Knowledge graph ontology design

Why provenance is the key to AI success

Henri Egle Sorotos
Beamery Hacking Talent

--

This blog accompanies the talk given by the author at Open Data Science Conference East, Boston 2022.

Photo by Christopher Burns on Unsplash

I’m not going to start this blog by using lots of superlatives describing how much data there is in the world. Take it as a given — there is a near limitless amount of data in the world.

In my role at Beamery, we are centralising our understanding of the world in a knowledge graph, by aggregating data from a large variety of sources. This includes everything from Human Capital Management systems to Wikipedia pages. We are a full-talent lifecycle company — so our domain is people, companies, skills and experiences.

We have a fair amount of data stored as knowledge — billions and billions of facts, growing all the time. All of this knowledge is stored in a graph using RDF semantic web technology. Whilst I work in the talent technology space, what follows could be applied to literally any domain — scientific, people, sales, inventory etc. etc.

Before we dive in on the provenance conundrum — let’s briefly explain what a knowledge graph is. In my own words, we are referring to:

A highly flexible no-sql database which represents data as “knowledge” through a graph-like structure of nodes and edges. Information is represented much like someone might draw a mindmap, or creatively related ideas together on a piece of paper. The nodes that refer to the knowledge are often defined in an ontology — the concepts that describe the domain. They can be traversed semantically using domain knowledge”

People often think about visualising knowledge graphs, using diagrams like this:

Whilst this can be quite sexy for marketing material, due to the sheer amount of data it’s often impractical. Realistically, the only benefit is to understand the classes that comprise an ontology, rather than the instances of these classes.

Now, not all knowledge graphs use the same underlying technology. In my career, I’ve almost always used Resource Description Framework (RDF) — an open standard often referred to as semantic web. We chose to adopt this because:

  • an open standard means we can remain database vendor agnostic
  • the technology is widely adopted in open data circles — meaning we can make use of publicly available linked data
  • there is a strong emphasis put on ontology design, meaning we can control our concepts that describe our domain. It also means we can semantically traverse the graph.
  • The nature of graph databases make it extremely easy to add new data as knowledge.

Here is a very professional diagram I have previously created to demonstrate where RDF exists in the wider database ecosystem. Note that proprietary systems like TigerGraph and Neo4j are not RDF databases.

A discussion of why we chose RDF, and OpenLink Virtuoso was previously written by my great colleague Kasper. A full list of RDF databases can be found here.

What is provenance?

Now for some more definitions. When we discuss data provenance (often referred to as lineage), what we are referring to is metadata that describes the origin of data. My semantic web friends who authored the PROV ontology have provided a more concrete definition:

a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance.”

Crucially, data being ingested into a knowledge graph can be derived from almost any source — open or proprietary. The idea is that we aggregate disparate datasets into a single unified source — knowledge. This is one of the reasons provenance is so crucial.

The flexibility of RDF comes into its own when we consider provenance — we can easily add new entities that describe the provenance of a given core entity. The working group for PROV-O originally authored an ontology to describe the provenance of data in any domain using RDF:

I’m conscious that this can all seem a little abstract. In reality, we adapt the above abstract concepts into our wider ontology to show the lineage of a particular asset or entity in our own domain. For instance — a webpage for a company taken at a monthly cadence provides assets for a given entity. This is why having a semi-rigid ontology is key. It allows us to ensure that concepts in our ontology are effectively attributed using provenance principles.

In my current role we use a flavour of the ADMS ontology, which was originally based on PROV. The core concepts are described below:

What does maintaining provenance in ontology design enable?

  1. Perfect playground for data science — the beauty of a RDF knowledge means that data is held in a highly flexible manner. It can be extracted at any granularity for machine learning tasks — including sub graphs for graph learning problems. All of this whilst maintaining the lineage of where the data came from.
  2. Ensures data quality — in data science, there is a common catch phrase — garbage in, garbage out. In a knowledge graph with such a huge amount from disparate sources, knowing where the data came from is crucial.
  3. Maintains context — even if the data is of high quality, it is important we understand the context behind metadata. For instance — the sectoral classification for two different company intelligence websites is not the same, even if the labels might be identical in some places.
  4. Entity reconciliation — one of the biggest problems in the digital world is understanding when two separate pieces of information are ultimately referring to the same instance of the same concept. This is known as entity reconciliation, and can be easily enabled using provenance modelling techniques.
  5. Compliance and Security — understanding the origin of data that could potentially end up in the hands of a customer is crucial to ensuring compliance.

Creating value for the future

It was a real privilege to speak at ODSC East 22. I plan to continue sharing my thoughts on why data provenance is so crucial in future work.

Speaking at ODSC East 22

Stayed tuned for future blogs and talks on AI and Semantic Web at Beamery:

  • Why semantic web is crucial to success in the HR and talent domains
  • How our core ontology was designed
  • How data from virtually any available data source can be integrated using entity reconciliation

Interested in joining our Engineering, Product & Design Team?

We’re looking for Data Scientists, Software Engineers Front & Back Mid/Senior, SRE Platform Engineers, Engineering Managers, Tech Leads, Product Operations/Managers/Designers across all levels and even more roles across London, Remote, USA & Berlin! to join us — apply here!

--

--