embedding_repo

Repository of pre-generated graph embeddings

View the Project on GitHub dliu18/embedding_repo

Repository of Pre-Generated Graph Embeddings

This repository contains pre-generated graph embeddings for a collection of commonly used graphs and embedding algorithms. The embeddings are contained in the embeddings/ directory and the file structure is as follows:

├── *graphName*
│   ├── *embeddingAlgorithm
│   │   └── {graphName}_{embeddingAlgorithm}_{embeddingDimension}_embedding.npy

Feedback: For comments, questions, and feedback please contact David Liu at liu.davi@northeastern.edu.

The pre-generated embeddings are saved as numpy arrays and can be loaded into memory with np.load. The dimensions of the multi-imensional array are N x d where N is the number of nodes in the graph and d is the embedding dimension. The order of the nodes in the array is the same as G.nodes() where G is the networkx graph generated by reading the corresponding edgelist in edgelists/.

The remainder of this documentation provides more detail about the graphs, embedding algorithms, and specific embedding execution details for reproducibility purposes.

Graphs

Name Type Nodes (N) Edges (E) Source
Wikipedia Word Co-occurrence 4,777 184,812 (Grover and Leskovec, 2016)
Facebook Social 4,039 88,234 SNAP
Protein-Protein Biological 3,890 76,584 (Grover and Leskovec, 2016)
ca-HepTh Citation 9,877 25,998 SNAP
LastFM Social 7,624 27,806 SNAP
Autonomous Systems Internet 22,963 48,436 Mark Newman’s Webpage

The edgelists for all of the above graphs are stored in edgelists/.

For sources that provide graphs in Matlab’s .mat format, edgelist were generated with the script in util/convert_mat_edglist.py.

Algorithms

Node Implementation Notes
SVD sklearn.decomposition.TruncatedSVD  
HOPE GEM beta is set to 0.01
Laplacian Eigenmap sklearn.manifold.SpectralEmbedding  
Node2Vec (Grover and Leskovec, 2016) Local copy maintained in node2vec/
SDNE shenweichen/GraphEmbedding  
Hyperbolic GCN HazyResearch Local copy maintained in hgcn/

Evaluation

To confirm the utility of the pre-generated embeddings, the embeddings were evaluated with the hypercompare package, the source code behind the paper “Systematic comparison of graph embedding methods in practical tasks”. The package provides a library capable of providing mapping accuracy, greedy routing, and link prediction scores for arbitrary embeddings.

Setup

The evaluation code is contained in eval/eval.py. To setup the computing environment for evalution, navigate to the eval directory and create a conda environemnt with the provided requirements.txt file. Also, follow the instructions in the hypercompare repository to install the testing library.

The hypercompare library expects all graph edgelists to be copied into hypercompare/hypercompare/code/lib/hypercomparison/data. The files should be named according to the format: {graph_name}_edges.txt where the graph name is the name in config/all.json (e.g. protein-protein_edges.txt).

To run the evaluation pipeline, execute

python eval.py ../config/all.json

Results

Below, we report the mapping accuracy scores for each combination of graph and embedding algorithm. The mapping accuracy is the Pearson correlation coefficient between the pairwise shortest-path distance in the embedding space and pairwise shortest-path distance in the original graph. For additional details see (Zhang, 2021). Figure 2 in the paper shows that the median (among 72 networks) mapping accuracy for Node2Vec and Laplacian Eigenmap is between 0.5 and 0.75. These values confirm that the accuracies reported below are reasonable.

Network SVD HOPE Laplacian Eigenmap Node2Vec SDNE HGCN
Wikipedia -0.22 -0.20 0.17 -0.03 -0.17 0.19
Facebook -0.09 -0.09 0.20 0.66 -0.16 0.27
Protein-Protein -0.46 -0.45 0.45 0.16 -0.46 0.25
ca-HepTh -0.38 -0.37 0.42 0.56 -0.46 0.25
LastFM -0.38 -0.36 0.41 0.52 -0.37 0.07
Autonomous Systems -0.33 -0.34 0.50 0.71 -0.17 0.09

Reproducibility

All of the embeddings were generated on the Discovery cluster with a GPU. The Discovery cluster has Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz CPUs and an NVIDIA Tesla K80 GPU.

The conda environments utilized to embed the graphs are specified in environments/. The base.yml environment was utilized to embed with SVD, HOPE, Laplacian Eigenmap, and Node2Vec; sdne.yml for SDNE; and hgcn.yml for Hyperbolic GCN.

When embedding with SDNE on the discovery cluster, cuda version 9.0 was loaded with module load cuda/9.0.

For HGCN, it is necessary to execute source set_env.sh in hgcn/ prior to launching the GPU. Then, after launching the GPU, as with SDNE, load cuda version 9.0 with module load cuda/9.0.

Contact / Contribute

For questions or requests to contribute/modify the repository please reach out to David Liu at liu.davi@northeastern.edu.