GraphQL API Disambiguation Tutorial

Learn how to use the disambiguation service through the Golden GraphQL API.

Overview

This is a short tutorial on how to use the disambiguation GraphQL API during triple submission. We will cover the basics, so we can successfully determine the subject id of the triples we want to submit.

Setting the disambiguation target

For example, let's assume we have the following triples on Apple that we want to submit to the Graph:
The above are triples corresponding to predicates already existing in the protocol's schema.

Basic entity disambiguation

If we want to disambiguate those triples, to get a list of existing entities in the graph that might be related, we can run the following GraphQL query via the API (no login required):
query DisambiguationQuery {
disambiguateTriples(
payload: {
triples: [
{predicate: "Name", object: "Apple"},
{predicate: "Website", object: "http://apple.com"},
{predicate: "Number of Employees", object: "154000"}
]
}
) {
entities {
id
name
date_created
distance
reputation
}
disambiguationCallId
}
}
Which at the time of writing these lines, returns the following:
{
"data": {
"disambiguateTriples": {
"entities": [
{
"id": "debcb513-b842-4645-9856-2f4ea975002b",
"name": "Apple (company)",
"date_created": "2022-01-29T16:14:30.067541",
"distance": 0.26666666666666666,
"reputation": 1
},
{
"id": "efbab68f-61a9-469a-857e-38965bc9116a",
"name": "Clustering",
"date_created": "2022-12-13T12:28:59.299942",
"distance": 0.4,
"reputation": 0.000005518280159191455
},
..
],
"disambiguationCallId": "ed86be05-5dd2-48e0-9aa0-498a7785f788"
}
}
}
Note: the disambiguationCallId is required to create an entity with the createEntity mutation.
In this particular response, we've received, as potential disambiguation entities, the following:
The API returns the results ranked by relevance, which, in this case, results in the correct entity being the number 1 result. The disambiguation service compares the submitted triples with already existing values on the graph, and then returns a list of entities that have similar values, ranked by a distance and reputation scores.
The distance score of the response is a value between 0 and 1, where 0 indicates a perfect match between the submitted triples and what's already on the graph, while 1 would, on the contrary, indicate a total mismatch.
The reputation score is a relative score to each particular disambiguation response, where the entity with the most reputation will always have the value 1, whereas the remaining entities will be a fraction of that value. Conceptually, entity reputation serves as a measure of how much effort has been placed in the creation of that given entity. Entities with more contributions, votes, backlinks, and older creation dates will be ranked higher than their more junior counterparts. We employ this system to discourage the addition of duplicates since, ideally, we would only have 1 entry in the graph for each canonical entity it represents.
Most of the time, out of all the candidate entities for disambiguation, we want to choose the one with the lowest distance (debcb513-b842-4645-9856-2f4ea975002b in this case), since it is the closest match to our submission. If the results are within the same distance range (≈±0.15) though, it is useful to use reputation (we want the highest value) and date_created (we want the oldest entity) to break the tie.

Determining triple submissions after disambiguation

Apart from a list of possible candidate entities, the disambiguateTriples() query can also return a diff listing which triples already exist on any given candidate, and which ones haven't been added. This is useful to inform the user which data has already been submitted and thus, avoid the creation of duplicate triples.
If we are interested in the diff, we can add it as part of the GraphQL request:
query DisambiguationQuery {
disambiguateTriples(
payload: {
triples: [
{ predicate: "Name", object: "Apple" }
{ predicate: "Website", object: "http://apple.com" }
{ predicate: "Number of Employees", object: "154000" }
]
}
) {
errors
entities {
id
name
date_created
distance
reputation
diff {
matches {
id
validation_status
predicate
object
}
inserts {
predicate
object
}
}
}
}
}
The response for the debcb513-b842-4645-9856-2f4ea975002b entity, now returns additional information in the diff property:
{
"id": "debcb513-b842-4645-9856-2f4ea975002b",
"name": "Apple (company)",
"date_created": "2022-01-29T16:14:30.067541",
"distance": 0.222222222222222
"reputation": 1,
"diff": {
"matches": [
{
"id": "55bc2c88-f81e-4908-9f53-bd4e0442d39c",
"predicate": "Website",
"validation_status": "PENDING",
"object": "http://apple.com"
},
{
"id": "c7e24fef-50a2-47de-9b65-74bf082d1153",
"validation_status": "PENDING",
"predicate": "Number of Employees",
"object": "154000"
}
],
"inserts": [
{
"predicate": "Name",
"object": "Apple"
}
]
}
}
In this case, we see that 2 of the 3 triples already exist in the graph (triple 55bc2c88-f81e-4908-9f53-bd4e0442d39cand c7e24fef-50a2-47de-9b65-74bf082d1153) and both of them are in PENDING status, as the voting on them has not yet reached consensus.
The ‘Name’ → ‘Apple’ triple appears as an insert, since the current value on the entity is different (‘Name’ → ‘Apple (company)’).

Recursive disambiguation

So far, we've only covered the case of predicates that have values as an object. That is, predicates that are not referencing another entity of the graph. But what happens if we have the following example?
  • Founder: Steve Jobs
  • CEO: Tim Cook
Both the Founder and CEO predicates take an entity as their object, so we can't perform a submission with just the name of the person.
Thankfully, we can still use the disambiguateTriples() query. In this case, to get the reference for the object we want to submit it as a triple. For example, we could perform a search for Steve Jobs with the following query:
query DisambiguationQuery {
disambiguateTriples(
payload: { triples: [{ predicate: "Name", object: "Steve Jobs" }] }
) {
errors
entities {
id
name
date_created
distance
reputation
diff {
matches {
id
validation_status
predicate
object
}
inserts {
predicate
object
}
}
}
}
}
Which yields the following response:
{
"data": {
"disambiguateTriples": {
"errors": null,
"entities": [
{
"id": "d8f80edd-a053-4d5d-9a24-eaf0d8fdfbad",
"name": "Steve Jobs",
"date_created": "2022-01-29T15:10:01.623848",
"distance": 0,
"reputation": 1,
"diff": {
"matches": [
{
"id": "300c6e27-40bd-49a5-8285-a69d8b39b74d",
"validation_status": "PENDING",
"predicate": "Name",
"object": "Steve Jobs"
}
],
"inserts": []
}
}
]
}
}
}
So, we've found that ‘Name’ → Steve Jobs’ matches the entity d8f80edd-a053-4d5d-9a24-eaf0d8fdfbad on the graph and we can now use that reference when we submit our triple.
If we have some additional information on the object, we can use those additional triples for a more accurate search. Let's try it with Tim Cook:
query DisambiguationQuery {
disambiguateTriples(
payload: {
triples: [
{ predicate: "Name", object: "Tim Cook" }
{ predicate: "Twitter URL", object: "https://twitter.com/tim_cook" }
{ predicate: "Date of Birth", object: "1955-02-24" }
]
}
) {
errors
entities {
id
name
date_created
distance
reputation
diff {
matches {
id
validation_status
predicate
object
}
inserts {
predicate
object
}
}
}
}
}
This returns:
{
"data": {
"disambiguateTriples": {
"errors": null,
"entities": [
{
"id": "87047cc6-e3bc-4f7b-8c6e-e880ded391b4",
"name": "Tim Cook",
"date_created": "2022-08-29T14:53:50.775486",
"distance": 0.3333333333333333,
"reputation": 0.35450469592382217,
"diff": {
"matches": [
{
"id": "cbd2f700-62b2-4e5b-95ca-a02541c62d6f",
"validation_status": "PENDING",
"predicate": "Name",
"object": "Tim Cook"
},
{
"id": "9efefc66-ff6a-4743-911a-a38785c7e376",
"validation_status": "PENDING",
"predicate": "Twitter URL",
"object": "https://twitter.com/tim_cook"
}
],
"inserts": [
{
"predicate": "Date of Birth",
"object": "1955-02-24"
}
]
}
},
{
"id": "68958b31-653b-4071-a47e-e344646a2826",
"name": "Alain Prost",
"date_created": "2022-06-20T16:02:04.707419",
"distance": 0.36007130124777187,
"reputation": 0.3559179064486544,
"diff": {
"matches": [
{
"id": "7160725f-fdd0-467e-815a-bb7d9edcad49",
"validation_status": "PENDING",
"predicate": "Date of Birth",
"object": "1955-02-24"
}
],
"inserts": [
{
"predicate": "Name",
"object": "Tim Cook"
},
{
"predicate": "Twitter URL",
"object": "https://twitter.com/tim_cook"
}
]
}
},
...
]
}
}
}
In this case, we've found that Tim Cook on the graph is represented by the entity 87047cc6-e3bc-4f7b-8c6e-e880ded391b4. Additionally, we can also see that we could submit his birthday into the graph, since it is listed as an insert, while other persons such as Alain Prost appear on the candidate list, as they share his same birthday, albeit at a higher distance score.

Summary

In this tutorial, we've covered the main uses of the disambiguateTriples() query on the GraphQL API, which is the main mechanism to find the entity identifiers in the graph when we have some information about an entity, but we can't locate its reference. Calling disambiguateTriples() is required before entity creation, as the disambiguationCallId returned is required when calling createEntity().
Disambiguation is a key process during triple submission, as it prevents the creation of entity duplicates and increases the usefulness of the graph as a source of knowledge.
Finally, we've also covered how the different metrics of the disambiguateTriples() query help us determine which of the results best matches our search, and how we can use the diff it provides to adjust the subsequent data submission to avoid the creation of triple duplicates