Repository of pre-generated graph embeddings
This repository contains pre-generated graph embeddings for a collection of commonly used graphs and embedding algorithms. The embeddings are contained in the embeddings/
directory and the file structure is as follows:
├── *graphName*
│ ├── *embeddingAlgorithm
│ │ └── {graphName}_{embeddingAlgorithm}_{embeddingDimension}_embedding.npy
Feedback: For comments, questions, and feedback please contact David Liu at liu.davi@northeastern.edu.
The pre-generated embeddings are saved as numpy arrays and can be loaded into memory with np.load
. The dimensions of the multi-imensional array are N x d where N is the number of nodes in the graph and d is the embedding dimension. The order of the nodes in the array is the same as G.nodes()
where G is the networkx graph generated by reading the corresponding edgelist in edgelists/
.
The remainder of this documentation provides more detail about the graphs, embedding algorithms, and specific embedding execution details for reproducibility purposes.
Name | Type | Nodes (N) | Edges (E) | Source |
---|---|---|---|---|
Wikipedia | Word Co-occurrence | 4,777 | 184,812 | (Grover and Leskovec, 2016) |
Social | 4,039 | 88,234 | SNAP | |
Protein-Protein | Biological | 3,890 | 76,584 | (Grover and Leskovec, 2016) |
ca-HepTh | Citation | 9,877 | 25,998 | SNAP |
LastFM | Social | 7,624 | 27,806 | SNAP |
Autonomous Systems | Internet | 22,963 | 48,436 | Mark Newman’s Webpage |
The edgelists for all of the above graphs are stored in edgelists/
.
For sources that provide graphs in Matlab’s .mat
format, edgelist were generated with the script in util/convert_mat_edglist.py
.
Node | Implementation | Notes |
---|---|---|
SVD | sklearn.decomposition.TruncatedSVD | |
HOPE | GEM | beta is set to 0.01 |
Laplacian Eigenmap | sklearn.manifold.SpectralEmbedding | |
Node2Vec | (Grover and Leskovec, 2016) | Local copy maintained in node2vec/ |
SDNE | shenweichen/GraphEmbedding | |
Hyperbolic GCN | HazyResearch | Local copy maintained in hgcn/ |
To confirm the utility of the pre-generated embeddings, the embeddings were evaluated with the hypercompare package, the source code behind the paper “Systematic comparison of graph embedding methods in practical tasks”. The package provides a library capable of providing mapping accuracy, greedy routing, and link prediction scores for arbitrary embeddings.
The evaluation code is contained in eval/eval.py
. To setup the computing environment for evalution, navigate to the eval directory and create a conda environemnt with the provided requirements.txt file. Also, follow the instructions in the hypercompare repository to install the testing library.
The hypercompare library expects all graph edgelists to be copied into hypercompare/hypercompare/code/lib/hypercomparison/data
. The files should be named according to the format: {graph_name}_edges.txt
where the graph name is the name in config/all.json
(e.g. protein-protein_edges.txt
).
To run the evaluation pipeline, execute
python eval.py ../config/all.json
Below, we report the mapping accuracy scores for each combination of graph and embedding algorithm. The mapping accuracy is the Pearson correlation coefficient between the pairwise shortest-path distance in the embedding space and pairwise shortest-path distance in the original graph. For additional details see (Zhang, 2021). Figure 2 in the paper shows that the median (among 72 networks) mapping accuracy for Node2Vec and Laplacian Eigenmap is between 0.5 and 0.75. These values confirm that the accuracies reported below are reasonable.
Network | SVD | HOPE | Laplacian Eigenmap | Node2Vec | SDNE | HGCN |
---|---|---|---|---|---|---|
Wikipedia | -0.22 | -0.20 | 0.17 | -0.03 | -0.17 | 0.19 |
-0.09 | -0.09 | 0.20 | 0.66 | -0.16 | 0.27 | |
Protein-Protein | -0.46 | -0.45 | 0.45 | 0.16 | -0.46 | 0.25 |
ca-HepTh | -0.38 | -0.37 | 0.42 | 0.56 | -0.46 | 0.25 |
LastFM | -0.38 | -0.36 | 0.41 | 0.52 | -0.37 | 0.07 |
Autonomous Systems | -0.33 | -0.34 | 0.50 | 0.71 | -0.17 | 0.09 |
All of the embeddings were generated on the Discovery cluster with a GPU. The Discovery cluster has Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz CPUs and an NVIDIA Tesla K80 GPU.
The conda environments utilized to embed the graphs are specified in environments/
. The base.yml
environment was utilized to embed with SVD, HOPE, Laplacian Eigenmap, and Node2Vec; sdne.yml
for SDNE; and hgcn.yml
for Hyperbolic GCN.
When embedding with SDNE on the discovery cluster, cuda version 9.0 was loaded with module load cuda/9.0
.
For HGCN, it is necessary to execute source set_env.sh
in hgcn/
prior to launching the GPU. Then, after launching the GPU, as with SDNE, load cuda version 9.0 with module load cuda/9.0
.
For questions or requests to contribute/modify the repository please reach out to David Liu at liu.davi@northeastern.edu.