Repository of Pre-Generated Graph Embeddings

This repository contains pre-generated graph embeddings for a collection of commonly used graphs and embedding algorithms. The embeddings are contained in the embeddings/ directory and the file structure is as follows:

├── *graphName*
│   ├── *embeddingAlgorithm
│   │   └── {graphName}_{embeddingAlgorithm}_{embeddingDimension}_embedding.npy

Feedback: For comments, questions, and feedback please contact David Liu at liu.davi@northeastern.edu.

The pre-generated embeddings are saved as numpy arrays and can be loaded into memory with np.load. The dimensions of the multi-imensional array are N x d where N is the number of nodes in the graph and d is the embedding dimension. The order of the nodes in the array is the same as G.nodes() where G is the networkx graph generated by reading the corresponding edgelist in edgelists/.

The remainder of this documentation provides more detail about the graphs, embedding algorithms, and specific embedding execution details for reproducibility purposes.

Graphs

Name	Type	Nodes (N)	Edges (E)	Source
Wikipedia	Word Co-occurrence	4,777	184,812	(Grover and Leskovec, 2016)
Facebook	Social	4,039	88,234	SNAP
Protein-Protein	Biological	3,890	76,584	(Grover and Leskovec, 2016)
ca-HepTh	Citation	9,877	25,998	SNAP
LastFM	Social	7,624	27,806	SNAP
Autonomous Systems	Internet	22,963	48,436	Mark Newman’s Webpage

The edgelists for all of the above graphs are stored in edgelists/.

For sources that provide graphs in Matlab’s .mat format, edgelist were generated with the script in util/convert_mat_edglist.py.

Algorithms

Node	Implementation	Notes
SVD	sklearn.decomposition.TruncatedSVD
HOPE	GEM	beta is set to 0.01
Laplacian Eigenmap	sklearn.manifold.SpectralEmbedding
Node2Vec	(Grover and Leskovec, 2016)	Local copy maintained in `node2vec/`
SDNE	shenweichen/GraphEmbedding
Hyperbolic GCN	HazyResearch	Local copy maintained in `hgcn/`

Evaluation

To confirm the utility of the pre-generated embeddings, the embeddings were evaluated with the hypercompare package, the source code behind the paper “Systematic comparison of graph embedding methods in practical tasks”. The package provides a library capable of providing mapping accuracy, greedy routing, and link prediction scores for arbitrary embeddings.

Setup

The evaluation code is contained in eval/eval.py. To setup the computing environment for evalution, navigate to the eval directory and create a conda environemnt with the provided requirements.txt file. Also, follow the instructions in the hypercompare repository to install the testing library.

The hypercompare library expects all graph edgelists to be copied into hypercompare/hypercompare/code/lib/hypercomparison/data. The files should be named according to the format: {graph_name}_edges.txt where the graph name is the name in config/all.json (e.g. protein-protein_edges.txt).

To run the evaluation pipeline, execute

python eval.py ../config/all.json

Results

Below, we report the mapping accuracy scores for each combination of graph and embedding algorithm. The mapping accuracy is the Pearson correlation coefficient between the pairwise shortest-path distance in the embedding space and pairwise shortest-path distance in the original graph. For additional details see (Zhang, 2021). Figure 2 in the paper shows that the median (among 72 networks) mapping accuracy for Node2Vec and Laplacian Eigenmap is between 0.5 and 0.75. These values confirm that the accuracies reported below are reasonable.

Network	SVD	HOPE	Laplacian Eigenmap	Node2Vec	SDNE	HGCN
Wikipedia	-0.22	-0.20	0.17	-0.03	-0.17	0.19
Facebook	-0.09	-0.09	0.20	0.66	-0.16	0.27
Protein-Protein	-0.46	-0.45	0.45	0.16	-0.46	0.25
ca-HepTh	-0.38	-0.37	0.42	0.56	-0.46	0.25
LastFM	-0.38	-0.36	0.41	0.52	-0.37	0.07
Autonomous Systems	-0.33	-0.34	0.50	0.71	-0.17	0.09

Reproducibility

All of the embeddings were generated on the Discovery cluster with a GPU. The Discovery cluster has Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz CPUs and an NVIDIA Tesla K80 GPU.

The conda environments utilized to embed the graphs are specified in environments/. The base.yml environment was utilized to embed with SVD, HOPE, Laplacian Eigenmap, and Node2Vec; sdne.yml for SDNE; and hgcn.yml for Hyperbolic GCN.

When embedding with SDNE on the discovery cluster, cuda version 9.0 was loaded with module load cuda/9.0.

For HGCN, it is necessary to execute source set_env.sh in hgcn/ prior to launching the GPU. Then, after launching the GPU, as with SDNE, load cuda version 9.0 with module load cuda/9.0.

Contact / Contribute

For questions or requests to contribute/modify the repository please reach out to David Liu at liu.davi@northeastern.edu.