This has been around for a while. It’s excellent, and I’ve tried to get it adopted at several life science organizations.
The problem is culture. Principal investigators are notoriously stubborn about how they do things, and that includes what they call things. Manufacturers and supply companies also don’t see much benefit to standardizing around this, and of course realize the large downside (alienating the aforementioned PIs).
What would help this would be the introduction of OWL into major software suites, primarily LIMS and ELN packages, and include tools for aligning terminology and concepts with OWL.
I think the various OBOFoundry ontologies are at different stages of maturity. One of their design principles is interoperability, but I'm not sure how often one is able to reason effectively across ontologies unless great care is taken to ensure that logical axioms are sound.
I'm involved in a project[0] that has an aim to provide better data integration by using OBOFoundry ontologies, but it's been a challenge in practice to merge the software with the ontologies in a coherent way.
There are a few which are really great. Basically my shortcut is Chris Mungall involved then it is logically and biologically ;) sound.
While we at SIB are more towards the RDF/SPARQL part of the spectrum (https://edu.sib.swiss/course/view.php?id=440). We do use OWL and obofoundry projects like Uberon and GO. For UniProt I took a lot of pain to make sure it is really compatible. We did the same for Rhea and ChEBI. This has paid of handsomely in new query capabilities.
> Principal investigators are notoriously stubborn about how they do things
I've come across things a simple as "I wrote a Perl script that does that, but it uses KEGG not GO terms" for their reasoning. That, or – if they're okay with using gene ontology – we have to also use the other system in parallel. And then the results becomes a discussion of how the two different annotation systems display the results.
With all potential advantages of semantic technologies, I'm wondering about whether their adoption is slowed down by performance issues of inference engines (reasoners) on very large (> 100B of nodes) datasets (e.g., AWS has decided to exclude semantic inference functionality from their Neptune graph database, citing performance issues - though relevant product leads have expressed interest in including inference, based on use cases etc.). Are there any recent achievements (preferably, open source) on the front of dramatically speeding up relevant engines?
As a semantic architect, this is not my experience. In fact, I see very few large graphs in the wild. The problem is, unsurprisingly, that describing data is difficult. Relating your own conceptualization of a domain to anothers is frustrating and time consuming. It will always be easier to create a bespoke model. So, people just don't do it.
As for OBO, there are many interesting comments here. The OBO ontologies all utilize BFO as an upper-level and in this regard they are united. But otherwise, their quality and utility varies tremendously.
I still believe in this work and hope that one day everyone will think about their data as being longer-lived and more important than the software that generated it.
Thank you for sharing your thoughts. Just curious: If you were tasked with architecting and implementing a semantic layer for a complex SaaS platform in a large domain from scratch, what would be your approach and what technology stack would you prefer to use and why? What best practices would you adopt, if any?
Many large OBO ontologies use EL++ reasoning (e.g. Elk), performing DL reasoning on smaller chunks (e.g. relations). Having said that newer reasoners like Konklude apparently do well with DL reasoning over combinations of large ontologies.
But ultimately it depends on what you want to do. In the life sciences subsets of FOL only buy you so much, and some kind of statistical or probabilistic inference is required. Mostly this is combined with logical inference in crude ad-hoc ways...
it is one of these misguided efforts to bring some order to life sciences but they go about it the wrong way.
It is so heavy-handed, the website so obtuse and confusing that no life scientist I know (and I have worked with hundreds of them) is even aware of let alone understand what an ontology is or how to use it.
Might be hard to believe for the uninitiated but they even got some of the namings is wrong from the start, for example, they have a GO (Gene Ontology) but that ontology is not actually describing genes! It describes gene products (like proteins) a huge big difference! Not to mention grossly misleading the very life scientists it is meant to help.
Curious, coming from a user named 'glofish' that they are unaware of GOs pervasiveness in the zebrafish model organsim world. GO is everywhere in the genomics/model organism world, and only growing in use. Also funny that the user is looking at labels, when concepts are the core of an ontology. Take a look at how many terms in GO are deprecated, it has evolved over time, I know very few scientists who get things right the first time. Sure there are issues, but many FOSS principles apply throughout. Also, NIH would be to differ that OWL/OBO isn't important- https://monarchinitiative.org/.
People do not understand and misuse GO. They think it is Gene-Ontology - it is not. It is gene product ontology.
In addition, I guarantee you that the vast majority of people using GO understand neither:
- what the ontology actually is
- how it works
- who decides what gets into the data
- why is something labeled a certain way
- what evidence is there
- how the terms interconnect
- what the hierarchy all means
All that because the concepts are not explained properly, nor the site is of any use to help you figure these out.
It is mostly an illusion - and I am saying that as someone that uses GO a lot. I am intimately familiar with all of its pitfalls. At best some people know is that a label is attached to a gene.
Finally GO is also perhaps the odd one out, the only ontology that is known somewhat because it is misused a lot.
I invite you to go to the link on the top post and note how many other ontologies are there ... hundreds? Ask a life scientist how many they have heard of.
Is my original statement all that wrong really? I don't think so. These ontologies are dead-end.
From in-depth experience I completely concur with your bullet points, they are all valid (spot on really), and in some cases huge blockers for larger scale adoption. I've had plenty of existential angst regarding the amount of time I've invested in the ontology world. Navigating, inferring, understanding, all the things necessary to make the use of ontologies, need a lot of work, particularly new interfaces (but- hmm, sounds like Science). IMO, however, these issues don't invalidate the underlying effort or goals. Biology is vast, and difficult. In my view these ontologies are focal points that force biologists to think about what they are doing, this, by itself, is enough, to me, everything else is bonus.
I've seen communities of biologists come to a new awareness as to how bad their existing scientific-terminology is when they go through ontology-building exercises. Scientists often use terms they think they know the meaning of because their academic ancestors all used those terms. Simply having scientists work through these issues is of value (again, Science == Slow).
Good luck with using AI to understand human labels, you're going to need more structure (formalized scientific consensus). Ontologies are one way to contribute to this structure/consensus. Of course they, like every other knowledge-base, are not a stand-alone answer.
This leaves me with: "I am saying that as someone that uses GO a lot."- but why!? Since you're following this "dead-end" I suspect you're not a scientist, but rather someone selling something, and as such you have no problems using a tool to make a $, even knowing it's pointless in the long run?
Can you expand on that? One of the main use cases for GO is statistical enrichment analyses. Of course, there are definitely problems here, from choice of appropriate way to correct for multiple hypotheses to appropriateness of frequentist vs Bayesian methods (e.g. Ontologizer). Is this what you are getting at?
How do you work with it? Are you writing scripts and sparqQL that query an endpoint, or ser you using an ontologi aware databrowser that can do calculations? Or something completely different?
For a Python library that intends to provide a level of abstraction more appropriate to bioinformatics use cases see https://github.com/biolink/ontobio/
Of course, it's always possible to use an RDF level library such as rdflib, but this can be low-level for OWL. Even simple bio-ontologies often make frequent use of existential restrictions and axiom annotations.
And of course it's possible to use a python-jvm bridge to access the fully featured java OWL API.
I'm skeptical about these - as often it's missing the point.
Two main challenges:
- one world view of categorization does not meet all needs and is fragile over time
- the challenge is still shared understanding between humans, not between computers.
Let me explain the second:
You have a large set of computer readable definitions - painstakingly built.
You have some new data that needs mapping into the ontology.
Typically a person has to do that.
To map they need to understand the ontology in the same way as every other person doing mappings.
The larger, more complex/'definitive' the ontology the harder this is.
Finally, science moves on and your ontology/and or understanding is out of date.
That's not to say shared definitions are not incredibly useful - just that writing them down in a computer friendly way doesn't guarantee that the definitions are actually shared between people.
There's an interesting clojure library called 'Tawny-OWL'[0] that is designed for building ontologies like this. It allows one to define a set of entities that follow patterns and logical axioms.
It has been tested heavily on all the ontologies on the obofoundry site.
But I agree in general that working with OWL ontologies (particularly those that use nested OWL constructs) can be difficult to work with in non-JVM languages.
Can someone help me understand what an ontology is supposed to be or do? I've seen the word come up in a number of HN articles lately and any definition I find online seems impossibly vague.
One very simple thing it does is help you understand hierarchical semantic relationships. So, for example, you could have
sibling =>
=> brother
=> sister
Then you know when eg: parsing text that if you observe the symbol "sister" in one place and "sibling" in another, they could be the same thing. This kind of thing is used in the medical space where one doctor might discribe a problem using a general term and the next in a more specific way and you have to be able to match them.
Isn't it what word2vec does? The algorithm gives vectors to words in such way that words appearing in similar contexts have similar vectors. I guess new deep networks give even better results.
Unsupervised methods like w2v look more scalable than hand made ontologies. Maybe ontologies help refine the result of word embeddings in this case?
Please remember, that these ontologies describe biological systems and are used as a standard for common nomenclature. It's orders of magnitudes harder to derive an ontology describing the localization of proteins in the cell by an algorithm, compared to w2v. Even then, there would be the fundamental issue that the ontology needs to be consistent over time in order to be useful.
It's a big graph, from graph theory. Think of a bunch of words connected to each other by lines ("edges") where the lines are some kind of causal link (e.g. both involved in processing sugars). It's basically organised like an XML file or JSON. The ontologies are actually quite small to download (46kb for the gene ontology). You can go look at one here [0]
The idea of the ontologies is to use the same words for the same things, and have a kind of summary for all the research into the things listed in the ontology. Having this all standardised allows for a lot of interoperability. Big problems with the ontologies is they are years out of date and have poor uptake in many parts of life science research. Which is shame because they are a fantastic fundamental tool, it's just that the life scientists are underfunded and already work ridiculous hours.
I worked for a SemanticWeb startup. About the time of the folksonomy debates. Lots of pretension and malarky. (The CTO liked to wax poetic about metametametadata. Not kidding.) Once I figured out that it's all just typed & labeled nodes and edges, everything kind of fell into place.
Various modeling systems will allow some parts to be implicit, for better conciseness. Analogy: when implementing a DAG, parents and children, the "edges" can be implemented with an array, which is mostly ignored from the outside.
An ontology is supposed to be a formal way of describing things so that the people who want to discuss their concepts have a very clear and disambiguated set of terms for conducting that discussion.
There's more to it than that, but really that's the most practical explanation of ontologies within the most common context you'll likely encounter it.
A good example is the BBC Program ontology[0] which you can see the practical usage of in the (still fantastic) talk 'Beyond the polar bear'[1]
Part of it is trivial. i.e. is "9606" used to identify a species or a pubmed paper. (While trivial it's a huge source of errors, a bit like manually managing memory trivial but still a huge source of CVEs)
The other is for logical data quality inference.
E.g. data is coded by super sub class, while query asks for middle class the right data is still retrieved.
This goes beyond that into ever more complicated scenarios which are not double by connect by queries.
Which get's into questions is the hind leg of an ant similar enough to a human leg to have an argument about the function of a protein transfer. (Uberon)
I have a keen interest in the semantic web and linked data (I have written two books on the subject, and the current book I am writing on the Hy programming language also has linked data and knowledge representation examples).
Much of the most interesting work has been done in medicine and biology.
Off topic, but I am really split by wanting to use full graph databases vs. RDF/RDFS/OWL. Different but overlapping use cases.
> I am really split by wanting to use full graph databases vs. RDF/RDFS/OWL
I think having an alternative OWL->graph mapping that utilized the features of property graphs would help a lot with using non-RDF graph databases. The RDF layering is sub-optimal.
If you are in Bio the SPARQL usecase is just so nice. I work on the sparql.uniprot.org and others at SIB. Being able to federate without a hassle is such a powerful thing for research.
The OBOFoundry has a set of principles[0] that not all ontologies conform to. An ontology project can register with the OBOFoundry through their GitHub site [1] if they follow the principles.
It is a combination of licensing. Snomed-CT is there in part via ULMS. But these days there is an official Snomed and Loinc via FHIR maintained by their hosting organizations.
At one point when I was at another company I was working on a graph database that connected all of the ontologies you mentioned and then some. It was really cool to be able to traverse the entire chain all the way from an analyte and see how it was linked through genes up to and including specific tests that were available for said gene.
I'm glad I don't work in healthcare anymore though... CPT is such a nightmare!
One thought about this, ICD9/10 are medical coding standards. They have only a very shallow semantic depth to them, they’re more about completeness and specificity. They don’t map well to what OWL tries to do.
2. I wish I could answer your question about the other projects. All I can say is that I saw this on Friday in a bioinformatics lecture. I'm not a bioinformatician. I don't know all the various projects relate to each other, although there should be some kind of liaison effort (right?). I do think I remember something about SNOMED, but the rest are blanks.
I am part of OBO (and GO, which has also been mentioned here a few times). It's nice to see the discussion here, lots of good comments, including the critical ones!
I'm going to address a few themes that have emerged, hopefully this is informative and can generate useful discussion.
Yes, the OBO site is not really very biologist friendly (or really anyone-friendly). It is more geared towards ontology developers, biocurators, and the kind of people who might build tools biologists would use. I would recommend portals such as the OLS (https://www.ebi.ac.uk/ols) for biologists -- but even this site tends to be used by bioinformatics-savvy folks. Domain scientists and users often use ontologies indirectly. For example, the Human Phenotype Ontology is used in many clinical settings for entering patient phenotypes, and subsequently in diagnosis (making use of the logical structure of the ontology). The cell ontology is central to many single-cell seq efforts. And of course the GO is ubiquitous in interpretation of experimental data via enrichment analyses.
One thing we have tried hard to do with the OBO site is ensure we have up to date metadata for all registered ontologies -- including GitHub trackers. So if you have comments on any particular ontology then engaging the developers via their issue tracker is strongly encouraged. And OBO itself has a tracker, as well as a mail list anyone can join.
One thing we have not done a great job of is in giving any kind of order to the many ontologies now listed. It can be overwhelming to someone not familiar with the field. Which ontology should I use ?(or avoid?) Which ontologies integrate well together, and which ones have duplication or incompatibilities? We have a number of new developments in the pipeline that should improve the situation here through the development of a new mid-upper level ontology called OBO Core (https://github.com/OBOFoundry/Experimental-OBO-Core/). We're also encouraging all ontology developers to use standard tooling and best practice which will make interoperation easier (https://github.com/INCATools/ontology-development-kit).
Finally, the OBO project itself receives only a tiny amount of short term funding, and most of the ontologies that comprise it have little or no direct funding (exceptions being widely used ones such as GO). A lot of work is community effort. Not saying that to deflect any criticism - constructive criticism is good! Just to provide an explanation of which some things are the way they are.
I've tried to work with these in both health science and agricultural industries.
The bottlenecks are the reasoning they support and the quality of the data being annotated with the ontologies. The long and the short of it is that every organization is going to partially disagree with the subtleties of certain categorizations which proliferates standards and ad-hoc modifications. There needs to be means of adjusting mappings at runtime and to store changes. It also needs to be so stupid/simple that an old-guard biologist can use it and immediately comprehend the value.
Querying ontologies is easy, working with annotations-qua-annotations is more difficult than it has to be, and as such organizations typically will want to roll their own.
The problem is culture. Principal investigators are notoriously stubborn about how they do things, and that includes what they call things. Manufacturers and supply companies also don’t see much benefit to standardizing around this, and of course realize the large downside (alienating the aforementioned PIs).
What would help this would be the introduction of OWL into major software suites, primarily LIMS and ELN packages, and include tools for aligning terminology and concepts with OWL.