🎄 Data Structures
A semantic space is a collections of identifiers for concepts. For example, the Chemical Entities of Biomedical Interest (ChEBI) has a semantic space including identifiers for chemicals. Within ChEBI’s semantic space, 138488 corresponds to the chemical alsterpaullone.
Identifiers are local to a namespace
138488 is a local unique identifier. Other semantic spaces might use the same local unique identifier to refer to a different concept in their respective domain.
Therefore, local unique identifiers should be qualified with some additional information saying what semantic space it comes from. The two common formalisms for doing this are Uniform Resource Identifiers (URIs) and Compact URIs (CURIEs):
In many applications, it’s important to be able to convert between CURIEs and URIs. Therefore, we need a data structure that connects the CURIE prefixes like CHEBI
to the URI prefixes like http://purl.obolibrary.org/obo/CHEBI_
.
Prefix Maps
A prefix map is a dictionary data structure where keys represent CURIE prefixes and their associated values represent URI prefixes. Ideally, these are constrained to be bijective (i.e., no duplicate keys, no duplicate values), but this is not always done in practice. Here’s an example prefix map containing information about semantic spaces from a small selection of OBO Foundry ontologies:
{
"CHEBI": "http://purl.obolibrary.org/obo/CHEBI_",
"MONDO": "http://purl.obolibrary.org/obo/MONDO_",
"GO": "http://purl.obolibrary.org/obo/GO_"
}
Prefix maps have the benefit of being simple and straightforward. They appear in many linked data applications, including:
- the
@prefix
declarations at the top of Turtle (RDF) documents and SPARQL queries - JSON-LD
- XML documents
- OWL ontologies
Loading prefix maps
Prefix maps can be loaded using Converter.from_prefix_map
.
However, prefix maps have the main limitation that they do not have first-class support for synonyms of CURIE prefixes or URI prefixes. In practice, a variety of synonyms are used for both. For example, the NCBI Taxonomy database appears with many different CURIE prefixes:
CURIE Prefix | Resource(s) |
---|---|
taxonomy |
Identifiers.org, Name-to-Thing |
taxon |
Gene Ontology Registry |
NCBITaxon |
OBO Foundry, Prefix Commons, OntoBee |
NCBITAXON |
BioPortal |
NCBI_TaxID |
Cellosaurus |
ncbitaxon |
OLS |
P685 |
Wikidata |
fj07xj |
FAIRsharing |
Similarly, many different URIs can be constructed for the same ChEBI local unique identifier. Using alsterpaullone as an example, this includes (many omitted):
URI Prefix | Provider |
---|---|
https://www.ebi.ac.uk/chebi/searchId.do?chebiId= |
ChEBI (first-party) |
https://identifiers.org/CHEBI: |
Identifiers.org |
https://identifiers.org/CHEBI/ |
Identifiers.org |
http://identifiers.org/CHEBI: |
Identifiers.org |
http://identifiers.org/CHEBI/ |
Identifiers.org |
http://purl.obolibrary.org/obo/CHEBI_ |
OBO Foundry |
https://n2t.net/chebi: |
Name-to-thing |
In practice, we need to be able to support the fact that there are many CURIE prefixes and URI prefixes for most semantic spaces as well as specify which CURIE prefix and URI prefix is the “preferred” one in a given context. Prefix maps, unfortunately, have no way to address this. Therefore, we’re going to introduce a new data structure.
Extended Prefix Maps
Extended Prefix Maps (EPMs) address the issues with prefix maps by including explicit fields for CURIE prefix synonyms and URI prefix synonyms while maintaining an explicit field for the preferred CURIE prefix and URI prefix. An abbreviated example (just containing an entry for ChEBI) looks like:
[
{
"prefix": "CHEBI",
"uri_prefix": "http://purl.obolibrary.org/obo/CHEBI_",
"prefix_synonyms": ["chebi"],
"uri_prefix_synonyms": [
"https://identifiers.org/chebi:"
]
}
]
An EPM is simply a list of records (see Record
). EPMs have the benefit that they are still encoded in JSON and can easily be encoded in YAML, TOML, RDF, and other schemata. Further, prefix maps can be automatically upgraded into EPMs (with some caveats) using curies.upgrade_prefix_map
.
Loading extended prefix maps
We are introducing this as a new standard in the curies
package. They can be loaded using Converter.from_extended_prefix_map
. Later, we hope to have an external, stable definition of this data schema.