CoKE
Contextualized Knowledge Graph Embedding
CoKE takes a sequence as input and uses a Transformer encoder to obtain contextualized representations. The model is then trained by predicting a missing component in the sequence, based on these contextualized representations.
Unlike sequential left-to-right or right-to-left encoding strategies, Transformer uses a multi-head self-attention mechanism, which allows each element to attend to all elements in the sequence, and thus is more effective in context modeling.

Take relation HasPart as an example, which also presents contextualized meanings, e.g., composition-related as (Table, HasPart, Leg) and location-related as (Atlantics, HasPart, NewYorkBay) (Xiao et al., 2016). Learning entity and relation representations that could effectively capture their contextual meanings poses a new challenge to KG embedding.
Problem Formulationβ
We are given a KG composed of subject-relation-object triples {(s, r, o)}. Each triple indicates a relation r β R between two entities s, o β E, e.g., (Barack Obama, Has Child, Sasha Obama). Here, E is the entity vocabulary and R the relation set. These entities and relations form rich, varied graph contexts. Two types of graph contexts are considered here: edges and paths, both formalized as sequences composed of entities and relations.
- An edge s β r β o is a sequence formed by a triple, e.g., Barack Obama β Has Child β Sasha Obama. This is the basic unit of a KG, and also the simplest form of graph contexts.
- A path is a sequence formed by a list of relations linking two entities, e.g., Barack Obama β Has Child (Sasha) ββββββ Lives In (US) βββ Official Language β English. The length of a path is defined as the number of relations therein. The example above is a path of length 3. Edges can be viewed as special paths of length 1.
Given a graph context, i.e., an edge or a path, we unify the input as a sequence , where the first and last elements are entities from , and the others in between are relations from . For each element in , we construct its input representation as:
where is the element embedding and the position embedding. The former is used to identify the current element, and the latter its position in the sequence. We allow an element embedding for each entity/relation in E βͺ R, and a position embedding for each position within length K. After constructing all input representations, we feed them into a stack of L successive Transformer encoders (Vaswani et al., 2017) to encode the sequence and obtain:
where is the hidden state of xi after the -th layer.
Training Tasksβ
To train the model, we design an entity prediction task, i.e., to predict a missing entity from a given graph context. This task amounts to single-hop or multi-hop question answering on KGs.
- Each edge s β r β o is associated with two training instances: ? β r β o and s β r β?. It is a single-hop question answering task, e.g., Barack Obama β Has Child β? is to answer βWho is the child of Barack Obama?β.
- Each path is also associated with two training instances, one to predict s and the other to predict o. This is a multi-hop question answering task, e.g., Barack Obama β Has Child β Lives In β Official Language β? is to answer βWhat is the official language of the country where Barack Obamaβs child lives in?β.
This entity prediction task resembles the masked language model (MLM) task studied in (Devlin et al., 2019). But unlike MLM that randomly picks some input tokens to mask and predict, we restrict the masking and prediction solely to entities in a given edge/path, so as to create meaningful question answering instances. Moreover, many downstream tasks considered in the evaluation phase, e.g., link prediction and path query answering, can be formulated exactly in the same way as entity prediction, which avoids training-test discrepancy.
