How we manage skills @ Beamery - Part I

Representing skills for today, and the unknown of tomorrow…

Henri Egle Sorotos
Beamery Hacking Talent

--

Photo by Moritz Kindler on Unsplash

Beamery Edge exists to bring practical AI solutions to the world of HR. Edge is a multi-disciplinary team of data scientists, knowledge engineers, software engineers and product managers. It’s the place I get to call home at Beamery, and I love it.

AI teams are nothing without high quality rich data on hand to create models and Edge is no different. Crucially, this data needs to be accessible, and representative of the domain that is being modelled. Enter the Beamery Talent Graph — a knowledge graph of 20 billion+ facts modelling the world of core business and HR concepts. Our AI is trained on these facts, and then these models are made available to our customers to apply to their own data. A discussion of why a knowledge graph approach was chosen by my great colleague Kasper Piskorski and details of our triple store can be found here. It’s well worth a read.

This knowledge is created by centralising and reconciling data from two groups of data sources: 1) Enterprise HRIS, HRMS and other company data sources that provide a holistic picture of an organisation when stitched together as knowledge 2) Open source data such as social media, news sources, company information, administrative data, and other linked datasets such as DBpedia. It is the single source of truth for Beamery, and underpins all AI work, as well serving as an enrichment source for other data sources we have. Knowledge is modified, added and deleted as we understand more about the domain — eg. a new administrative dataset is added and reconciled into out core knowledge graph, or an additional HRIS is plugged into our complete view of knowledge.

To give you a flavour, the following core entities are available within the graph:

  • people
  • companies
  • experiences
  • educations
  • skills
  • locations

All of these concepts are highly interconnected and inferences can be drawn by traversing the graph. This is one of the most important benefits of linked data, and why RDF semantic web technology was chosen to underpin this work.

Skills as 1st class citizens

As part of this universe of concepts, skills are very important, in fact, I would say that they are first class citizens within the graph. Let me explain the three fundamental reasons this is the case:

  1. Real world recruitment problems are optimising for skills, not people — finding the correct candidate for a job, writing a job specification efficiently, and assessing supply and demand in the marketing. These are real world problems currently costing time and money to businesses around the world. At the core of all these problems is our understanding of skills.
  2. Real world workforce problems are optimising for skills, not people — enterprise organisations face a deluge of challenges to ensure happy, productive and compliant workforces are achieving their complete potential. Career pathing, upskilling and coaching, identification of jobs at risk of automation — these all require a deep understanding of skills.
  3. Other concepts are proxies for skills — people, experiences, educations, companies; these are all proxy nodes for skills. A person has a collection of skills — a representation of their capability is via the skills that they are linked to. Equally, a person can acquire skills via a specific role at an experience or school programme with an education. Finally, companies have certain capabilities as a result of the people they employ. These hires will, in-turn, bring more skills.

Our ontology, the model we have created to represent the relationship between these concepts, has been designed with exactly this in mind. Skills are at the centre of our understanding in the Beamery Talent Graph, and accessing people, companies, experiences and education via skills can be done with ease.

One truth above all others

Now, all of this speak about skills and their importance is completely pointless unless we have a shared understanding of what a skill can actually be. The benefits of our knowledge graph as a single source of truth cannot be realised unless distinct normalised concepts are recognised.

This is where the concept of ‘normalised’ or ‘canonical’ terms and concepts become increasingly important. In the realm of skills normalisation, there are two different problems we are discussing here:

  1. An agreed set of skills concepts
  2. An agreed set of instances of these concepts

Problem 1

If you were to go into a room and ask people to list all different types of skill they know, you would likely get responses like the following:

  • soft skills
  • hard skills
  • people skills
  • transferrable skills
  • knowledge
  • personal attributes
  • values
  • cultures
  • the art of doing something
  • competencies
  • proficiency
  • expertise
  • domain specific skills
  • domain generic skills
  • ability

This is a big problem. Not having a shared language for skill types is one of the biggest problems facing business — it means we are unable to quantify skill adoption, depth, availability etc. In addition, none of these concepts are distinct or mutually exclusive. Many are unclear and subjective. It is crucial that there is a set of agreed upon and unambiguous concepts representing skills that have defined relationships between one another.

Some at Beamery think this problem is the biggest stifler of business productivity globally.

Problem 2

Take ten ATS entries for the same role at different companies, undoubtedly you are going to find the ‘same’ skills with different natural language representations. For example, take the terms ‘CSS’, ‘Cascading Style Sheets’ and ‘CSS/CSS3’. It is entirely plausible that these three different terms could appear on three different job descriptions. However, one could assume that these are just different representations of the same shared normalised skill: ‘CSS’.

It is this linking of unnormalised to normalised instances of concepts that is crucial to unlocking the value of skills in a knowledge graph. In other words, we need an agreed set of labels that represent normalised skills.

These two problems fall under the realm of entity reconciliation, a field of significant interest in the semantic web community. A simple visual example from the McCallum et al. paper, 2000:

Here we are reconciling entities representing people.

Different approaches to skill representation

There are a plethora of different providers which have created labels of distinct skills for consumption by both machines (AI training) and humans. Some, but not all, have also disaggregated what a skill is into some different types of skills — this is what we were discussing in problem 1.

Providers have gathered unnormalised raw skills available from source material and experts in the field — job descriptions, CVs, industry experts, news articles. Once these raw unnormalised skills have been found, agreed normalised mappings are found for all to create a set of distinct terms: ‘normalised skills’. Some of these providers have gone a step further and classified these instances of skills into different types of ‘skill concepts’.

If we look at the market, there are broadly two schools of thought in this area. These are two separate groupings of organisations that have taken different approaches to developing an agreed set of skills concepts and instances of these concepts:

  1. Public Sector Open Source — hand-written ontologies, curated by a mixture of academic and industry experts in partnership with the public sector.
  2. Private Proprietary Closed Source — machine generated ontologies curated using keyword extraction from job descriptions, job adverts and other corpus of text relating to the world of work.

Open Source Skills Ontologies

If you have ever worked with skills ontologies, you are likely to have come across the two dominant open source players in this area — USA based O*NET (Occupational Information Network) and ESCO (European Skills, Competences, Qualifications and Occupations). However, before both of these models were created, back in the day ISCO was born in 1958 — the International Standard Classification of Occupations from the International Labour Organisation (ILO).

To provide some context, ISCO is a broad, high level taxonomy of job roles and higher level job groupings. It does not include skills, but has since been mapped to other skills resources. It was designed to bring some structure to economic analysis at the United Nations and provide nation states when a useful resource for economic policy. To give you an idea of its granularity, ISCO has the following high-level occupation groups in the latest 2008 iteration:

  1. Managers
  2. Professional
  3. Technicians and associate professionals
  4. Clerical support workers
  5. Service and sales workers
  6. Skilled agricultural, forestry and fishery workers
  7. Craft and related trades workers
  8. Plant and machine operators, and assemblers
  9. Elementary occupations
  10. Armed forces occupations

As with any taxonomy, these terms are populated with secondary and other tertiary terms at lower levels to provide further context and usability.

O*NET came along in the 90s to fulfil what ISCO couldn’t — it provided more granular context around occupations. Occupation remain at the centre of the model, but crucially, there is further context on required skills and industrial classifications. See the diagram below which describes the context available for each role:

Pretty useful stuff, huh!?

Well, ESCO have done something similar. However, the ESCO occupations terms are based entirely on ISCO as a starting point, whilst O*NET was only mapped to the model later after its creation. ESCO is organised into three ‘pillars’:

This results in a deep hierarchy across the different concepts:

Again, a very useful tool in a variety of situations.

For the record, I am not suggesting that these are the only open source models available for modelling skills, just that these are the most well-known. Some others I have experience using or evaluating:

Despite there being a number of these open source models, they have a similar approach when being created and the results have similar themes. In detail, there is:

  • High degree of skill concept disaggregation — open source models provide multiple skills concepts. For example, transversal skills, software skills, language skills etc. This richness is invaluable.
  • Low degree of skill instance depth — despite having a number of different skill concepts, the instances of these concepts are often not as numerous as one would like.
  • They are generated top-down — construction of these models is done by creating high-level categories that are then linked to secondary and tertiary terms. This process is often completed by hand. What you are left with is an extremely clean, but often small, list of skills instances.
  • They are slow to create and update — model iteration takes many years, and involves many stakeholders. This means long lead times, and often a lac of timeliness in the instances of the terms. ESCO and O*NET are also politically motivated projects, which also slows down the process.
  • There is a high degree of domain involvement — generally speaking, open source ontologies are created in partnership with industry leaders and government.

Closed Source Skills Ontologies

Since the late-90s, various companies have entered the skills modelling space. This has partly been powered by the availability of computational resources that can quickly generate keywords from large corpus of text. They have generally entered the market from two different angles:

  • Job Aggregation startups — companies that began life creating candidate facing job portals.
  • Internal HR optimisation startups — entities that provide workplace analytics services and have moved into modelling skills capabilities of enterprise organisations.

A non exhaustive overview of the market looks a little like this:

Some of these providers began their models using the work of open source providers such as ESCO and O*NET. Many have created huge lists of distinct skills by running keyword extraction over job descriptions, job adverts and other sources — often using techniques such as NER-RAKE.

Again, these providers exhibit similar characteristics. Generally speaking, they have:

  • Low degree of skill concept disaggregation — often these providers only provide a single concept for skills — ‘skills’. Where there is some disaggregation, these is only a small subset of the terms that are added to the sub-concept — eg. natural languages, or ‘soft’ skills.
  • High degree of skill instance depth — using machines to find skills means a large number of terms can be found in a short space of time. That said, these terms can sometimes lack cleanliness and entity reconciliation is an issue.
  • They are generated bottom up — these models are often created by generating a huge number of skill terms, and then providing structure in the way of concepts and links later.
  • They are quick to create and update — the sheer volume of natural language data available representing jobs and skills means finding new skills and monitoring the usage of other terms can be done quickly compared to the hand-written approach of open-source models.

Why Beamery Edge is ‘doing skills’ a little differently than the rest…

Both of the approaches discussed clearly have their stand-out benefits:

  • Open Source approaches — high degree of distinct and interlinked skills concepts.
  • Closed Source approaches — high degree of distinct skills instances and speed of creation.

What Beamery is trying to do is clear — marry the best from both schools of thought discussed above — each of these approaches have their own merit, and we want to take the best from both. Think of it as a third way — our approach is characterised by:

  • multiple distinct linked skills concepts.
  • high volume of accurate instances populating these concepts.

Thanks to the excellent work of my top colleague, Kaan Karakeben, we have our own skills ontology of ~16k distinct canonical skills, derived from ~20 million unnormalised skills in our Talent Graph. All skills are then categorised into a sub-concept of a ‘skill’. This provides a high degree of both conceptual understanding, but also richness of instances of these concepts.

More on how we created these concepts, and what we do to ensure the instances of these concepts remain current and reactive to the world around us in the next part of this blog series on skills…….

Interested in joining our Engineering, Product & Design Team?

We’re looking for Data Scientists, Software Engineers Front & Back Mid/Senior, SRE Platform Engineers, Engineering Managers, Tech Leads, Product Operations/Managers/Designers across all levels and even more roles across London, Remote, USA & Berlin! to join us — apply here!

--

--