Benchmarks

Some datasets are usually used as benchmarks for knowledge embedding, including FB15K, FB13, WN18 and WN11. We provide FB15K and WN18 as examples to introduce the format of input files for our framework.

Datasets are required in the following format, containing five files:

train.txt : the training file in a format with (e1, e2, rel) per line, the first line is the number of triples.

valid.txt : the validation file, same format as train.txt.

test.txt : the testing file, same format as train.txt.

entity2id.txt : all entities and corresponding ids, one per line.

relation2id.txt: all relations and corresponding ids, one per line.

The original data can also be downloaded from:

FB15K, WN18 are published by "Translating Embeddings for Modeling Multi-relational Data (2013)." [download]

FB13, WN11 are published by "Reasoning With Neural Tensor Networks for Knowledge Base Completion". [download]

Toolkits

We provide several toolkits for knowledge embedding, containing the following four repositories:

OpenKE

This is an Efficient implementation based on TensorFlow for knowledge representation learning (KRL). We use C++ to implement some underlying operations such as data preprocessing and negative sampling. For each specific model, it is implemented by TensorFlow with Python interfaces so that there is a convenient platform to run models on GPUs.

OpenKE provides simple interfaces to train and test various KRL models, which does not need too much efforts for redundant data processing and memory control. OpenKE has implemented some classic and effective models to support knowledge embedding, these models include:

TransE [paper]
TransH [paper]
TransR [paper]
TransD [paper]
RESCAL [paper]
DistMult [paper]
HolE [paper]
ComplEx [paper]

We provide tutorials for training these models. Additionally, we use some simple examples to show how to build a new model based on OpenKE.

GitHub

KB2E

KB2E is the early implementation of some knowledge embedding models, and many resources are used in our following works. These codes will be gradually integrated into the new framework OpenKE. This is a basic and stable knowledge graph embedding toolkit, including TransE, TransH, TransR and PTransE. The toolkit implementation conforms to the original paper setting of models, which makes it stable for experiments in research work.

GitHub

Fast-TransX

This is an efficient lightweight implementation of TransE and its extended models for knowledge representation learning, including TransH, TransR, TransD, TranSparse and PTransE. The overall framework has underlying design changes for acceleration and supports multi-threading training. Fast-TransX is designed for simple and quick deployment utilizing the framework of OpenKE.

GitHub

TensorFlow-TransX

This is a light and simple version of OpenKE based on TensorFlow, including TransE, TransH, TransR and TransD. Similar to Fast-TransX, TensorFlow-TransX is implemented to avoid complicated encapsulation utilizing the same framework of OpenKE.

GitHub

Pretrained Embeddings

Available pretrained embeddings of the existing large-scale knowledge graphs trained using OpenKE (These are all currently trained via TransE. More models will come if necessary).

The knowledge graphs and embeddings contain the following five files:

Embeddings of the entities: The embeddings for each entity(item) in knowledge graphs. The data are in a binary format with one embedding per line. For each line, there are numbers of consecutive float to represent its embedding.

Embeddings of the relations: The embeddings for each relation(property) in knowledge graphs. The data are in a binary format with one embedding per line. For each line, there are numbers of consecutive float to represent its embedding.

Triple2id : The mapping from knowledge triples(fact) of knowledge graphs to their corresponding serial numbers. For each line, there is a triple and its serial number split by a tab.

Entity2id : The mapping from entities(item) of knowledge graphs to their corresponding serial numbers. For each line, there is an entity and its serial number split by a tab.

Relation2id : The mapping from relaitons(property) of knowledge graphs to their corresponding serial numbers. For each line, there is a relation and its serial number split by a tab.

File descriptions and download links:

Knowledge Graph	Description	Size	Download
Wikidata	Embeddings of the entities	> 4GB	Download
	Embeddings of the relations	< 1MB
	List of the entity ids	360MB
	List of the relation ids	< 1MB
	List of the triple ids	1GB
Freebase	Embeddings of the entities	> 15GB	Download
	Embeddings of the relations	< 10MB
	List of the entity ids	1.5GB
	List of the relation ids	< 1MB
	List of the triple ids	6GB
XLORE	Embeddings of the entities	< 4GB	Download
	Embeddings of the relations	< 60MB
	List of the entity ids	< 500MB
	List of the relation ids	< 2MB
	List of the triple ids	< 1GB

How to read the binary files:

Python

#Python codes to read the binary files.
import numpy as np
vec = np.memmap(filename , dtype='float32', mode='r')

C/C++

//C(C++) codes to read the binary files.
#include <cstring>
#include <cstdio>
#include <cstdlib>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>

struct stat statbuf;
int fd;
float* vec;

int main() {
  if(stat(filename, &statbuf)!=-1) {
    fd = open("relation2vec.bin", O_RDONLY);
    vec = (float*)mmap(NULL, statbuf.st_size, PROT_READ, MAP_PRIVATE, fd, 0); 
  }
  return 0;
}

More information:

Here are the dumps of the Wikidata. If necessary, you can click Wikidata for more information.

Here are the dumps of the Freebase. If necessary, you can click Freebase for more information.

If necessary, you can click XLORE for more information.