Protein Ontology Linked Open Data

Introduction

Protein Ontology Linked Open Data exposes, shares, and connects pieces of data, information, and knowledge about protein-related entities on the Semantic Web using URIs and RDF. It provides an alternative way to access the Protein Ontology for querying and integrating with other Linked open datasets. Protein Ontology Linked Open Data is served by an OpenLink Virtuoso Universal Server. Data can be queried using SPARQL. RDF data dumps are also available for download. We provide metadata description for PRO Linked Open Data, which is compliant with the W3C HCLS specification.

As a formal, explicit specification of a domain of interest, ontologies consist of precisely defined terms and the relationships between them, thus imparting a hierarchical organization. Ontologies are increasingly being used to define the basic terms and relations in biological domains, often as the foundation for search, integration and exchange of biological data. Protein Ontology (PRO) provides an ontological representation of protein-related entities. In addition to the ontology itself, PRO also includes other information, such as ontology annotation and cross-reference information.

As an evolving extension to the current hypertext document Web, Linked Data is a new paradigm where data are published and interconnected on the Web using open standards such as URIs, HTTP, RDF, OWL, SPARQL etc. This enables data from heterogeneous sources to be shared, integrated and queried in a Web of Data. Tim Berners-Lee in his Web architecture note Linked Data introduced a set of best practices for publishing and interlinking structured data on the web. They have become well known as the Linked Data principles:

Use URIs as names of things.
Use HTTP URIs, so that people can look up those names.
When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
Include links to other URIs, so that they can discover more things.

The key technologies to support Linked Data include:

URI references: used to uniquely identify things in the world. They can be Web documents, real world objects or abstract concepts. By using URIs as global identifiers for entities, hyperlinks can be set between entities in different data sources. All the Linked Data can be connected into a Web of Data. The links between URIs also enable the discovery of new data sources.
HTTP protocol: provides a universal access mechanism on the Web that allows the URIs to be dereferenced into a description of their referenced things.
RDF (Resource Description Framework): is a labeled direct graph based data model for publishing data on the Web. The description of a resource in RDF is represented by a number of triples. Each triple consists of three parts: Subject, Predicate and Object. The Subject is the URI identifying the resource being described. The Object can either be a literal value or URI of another resource related to the Subject. The Predicate is a URI that indicates the relationship between the Subject and the Object.
OWL (Web Ontology Language): RDF can not provide any domain specific terms for describing things in the world and their relationships. This function can be achieved by using SKOS (Simple Knowledge Organization System) that provides a vocabulary for expressing conceptual hierarchies, and RDFS (RDF Schema) and OWL (Web Ontology Language) that provide vocabularies for describing a formal, explicit specification of a shared conceptualization of a domain of interest. Using shared vocabularies allows the terms from different vocabularies to be connected.
SPARQL (SPARQL Protocol and RDF Query Language) is a semantic query language to retrieve and manipulate data stored in RDF format. It provides language for query (SELECT, CONSTRUCT, ASK, and DESCRIBE) and update (INSERT, DELETE, LOAD, CLEAR, CREATE, DROP, COPY, MOVE, ADD), as well as Federated Query for executing queries distributed over different SPARQL endpoints.

To cite our work:

Chuming Chen, Hongzhan Huang, Karen E. Ross, Julie E. Cowart, Cecilia N. Arighi, Cathy H. Wu & Darren A. Natale.
Protein ontology on the semantic web for knowledge discovery.
Scientific Data, 7:337, https://doi.org/10.1038/s41597-020-00679-9 (2020)