Skip to main content

CoKE

Contextualized Knowledge Graph Embedding

CoKE takes a sequence as input and uses a Transformer encoder to obtain contextualized representations. The model is then trained by predicting a missing component in the sequence, based on these contextualized representations.

Unlike sequential left-to-right or right-to-left encoding strategies, Transformer uses a multi-head self-attention mechanism, which allows each element to attend to all elements in the sequence, and thus is more effective in context modeling.

An example of Barack Obama, where the left subgraph shows his political role (dashed blue) and the right one his family role (solid orange).

Take relation HasPart as an example, which also presents contextualized meanings, e.g., composition-related as (Table, HasPart, Leg) and location-related as (Atlantics, HasPart, NewYorkBay) (Xiao et al., 2016). Learning entity and relation representations that could effectively capture their contextual meanings poses a new challenge to KG embedding.

Problem Formulation​

We are given a KG composed of subject-relation-object triples {(s, r, o)}. Each triple indicates a relation r ∈ R between two entities s, o ∈ E, e.g., (Barack Obama, Has Child, Sasha Obama). Here, E is the entity vocabulary and R the relation set. These entities and relations form rich, varied graph contexts. Two types of graph contexts are considered here: edges and paths, both formalized as sequences composed of entities and relations.

  • An edge s β†’ r β†’ o is a sequence formed by a triple, e.g., Barack Obama β†’ Has Child β†’ Sasha Obama. This is the basic unit of a KG, and also the simplest form of graph contexts.
  • A path sβ†’r1β†’β‹…β‹…β‹…β†’rkβ†’os β†’ r_1 β†’ Β· Β· Β· β†’ r_k β†’ o is a sequence formed by a list of relations linking two entities, e.g., Barack Obama β†’ Has Child (Sasha) βˆ’βˆ’βˆ’βˆ’βˆ’β†’ Lives In (US) βˆ’βˆ’β†’ Official Language β†’ English. The length of a path is defined as the number of relations therein. The example above is a path of length 3. Edges can be viewed as special paths of length 1.

Given a graph context, i.e., an edge or a path, we unify the input as a sequence X=(x1,x2,β‹…β‹…β‹…,xn)X = (x_1, x_2, Β· Β· Β· , x_n), where the first and last elements are entities from EE, and the others in between are relations from RR. For each element xix_i in XX, we construct its input representation as:

hi0=xiele+xipos,h_i^0 = x_i^{ele} + x_i^{pos},

where xielex_i^{ele} is the element embedding and xiposx_i^{pos} the position embedding. The former is used to identify the current element, and the latter its position in the sequence. We allow an element embedding for each entity/relation in E βˆͺ R, and a position embedding for each position within length K. After constructing all input representations, we feed them into a stack of L successive Transformer encoders (Vaswani et al., 2017) to encode the sequence and obtain:

hil=Transformer(hilβˆ’1),Β l=1,2,β‹…β‹…β‹…,Lh_i^l = Transformer(h_i^{lβˆ’1}),\ l= 1, 2, Β· Β· Β· , L

where hilh_i^l is the hidden state of xi after the ll-th layer.

Training Tasks​

To train the model, we design an entity prediction task, i.e., to predict a missing entity from a given graph context. This task amounts to single-hop or multi-hop question answering on KGs.

  • Each edge s β†’ r β†’ o is associated with two training instances: ? β†’ r β†’ o and s β†’ r β†’?. It is a single-hop question answering task, e.g., Barack Obama β†’ Has Child β†’? is to answer β€œWho is the child of Barack Obama?”.
  • Each path sβ†’r1β†’β‹…β‹…β‹…β†’rkβ†’os β†’ r_1 β†’ Β· Β· Β· β†’ r_k β†’ o is also associated with two training instances, one to predict s and the other to predict o. This is a multi-hop question answering task, e.g., Barack Obama β†’ Has Child β†’ Lives In β†’ Official Language β†’? is to answer β€œWhat is the official language of the country where Barack Obama’s child lives in?”.

This entity prediction task resembles the masked language model (MLM) task studied in (Devlin et al., 2019). But unlike MLM that randomly picks some input tokens to mask and predict, we restrict the masking and prediction solely to entities in a given edge/path, so as to create meaningful question answering instances. Moreover, many downstream tasks considered in the evaluation phase, e.g., link prediction and path query answering, can be formulated exactly in the same way as entity prediction, which avoids training-test discrepancy.