How to easily build your own domain specific knowledge graph

Henri Egle Sorotos
3 min readDec 14, 2022

--

This post was inspired by a Stack Overflow Q&A:

So I am new to the world of the semantic web, RDF’s and Ontologies How does the process of creating knowledge graphs work? Say I want to create knowledge graphs about a specific team and link everything about it the players, trophies and everything how will it go? Do I first scrape data about the team? Do I convert from CSV to RDF triples. And where do Data Science, NLP and Machine Learning fall into all this?

With thanks to Nisi Zenuni for asking.

Now there are many aspects to this question. I will take each in turn. Fundamentally, it is all about not-reinventing the wheel.

How do you create a domain specific knowledge graph?

Say I want to create knowledge graphs about a specific team and link everything about it the players, trophies and everything how will it go?

This is a broad question, and the answer is also quite high-level. Some steps:

  1. Design an ontology to represent the knowledge in your knowledge graph. The ontology represents the classes, which will be populated with instances. In this case a class could be players and an instance could be a player in your team. The players class could be linked to the trophies class to show which players have won trophies. This guide might prove useful
  2. Procure data to populate your ontology. I don’t have domain knowledge of this area, but web data sounds like it could work.
  3. Find an appropriate database to store your graph. Based on the tags, it sounds like you want to use RDF — Virtuoso, GraphDB and Marklogic all offer free versions you can run locally.
  4. Ingest your data. RDF graphs CRUD operations can be executed using SPARQL. Take a look at the SPARQL INSERT operation. There are also more complex frameworks for turning data into knowledge graphs.

However, given the use-case I would ignore everything I’ve written above as this sounds like a solved problem. See, the beauty of RDF is that there is a big community of open data and shared ontologies. It is likely the graph you want to create could at least partially be sourced from existing public graphs which already aggregate and crowd-source data from the web.

See the SPARQL endpoints:

Using these you can extract data in a variety of formats, or spin up local versions of these graphs. Or create your own graph using the CONSTRUCT operation in SPARQL. This kind of answer on StackOverflow shows how it can be done with DBPedia using python. See:

from rdflib import Graph
from SPARQLWrapper import SPARQLWrapper, RDF

# create empty graph
g = Graph()

# execute SPARQL CONSTRUCT query to get a set of RDF triples
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.addDefaultGraph("http://dbpedia.org")
query = """
PREFIX dbpedia: <http://dbpedia.org/resource/>
CONSTRUCT {
?s rdf:type dbo:PopulatedPlace.
?s dbp:iso3166code ?code.
?s dbo:populationTotal ?pop.
} WHERE{
?s rdf:type dbo:PopulatedPlace.
?s dbp:iso3166code ?code.
?s dbo:populationTotal ?pop.
FILTER (?s = dbpedia:Switzerland)
}
"""
sparql.setQuery(query)
try :
sparql.setReturnFormat(RDF)
results = sparql.query()
triples = results.convert() # this converts directly to an RDFlib Graph object
except:
print "query failed"


# add triples to graph
g += triples

Do I first scrape data about the team? Do I convert from CSV to RDF triples?

I would avoid scraping if you can, and try to rely on the above public graphs that already exist. However, scraping is an option if required. Rely on the community of RDF linked data.

And where do Data Science, NLP and Machine Learning fall into all this?

Increasingly knowledge graphs are being used as part of machine learning workflows. There are a few reasons for this:

  • graphs provide a rich and highly connected web of data. Having more context is generally thought to result in better models as feature variables are richer.
  • data in a graph can be extracted at a specified granularity, so it is possible to solve a variety of downstream use-cases, whilst retaining semantic meaning.
  • This rise of trained models using graph neural networks is fuelling the increasing adoption of knowledge graphs.
  • Modern machine learning requires increasing amounts of data, the likes of which can only be found on the web. RDF has a long-history of aggregating web data in public knowledge graphs.

--

--

Responses (1)