GraphStorm v0.2.1 release
GraphStorm V0.2.1 release contains a few major feature enhancements. In this release, we have enhanced the GraphStorm model inference use experience by automatically mapping inference results (prediction results and generated node embeddings) into Raw Node ID space, i.e., the same ID space as the input raws data. The resulting output will be stored in parquet format. We have added a new inference command (graphstorm.run.gs_gen_node_embedding) for computing node embeddings on any given graph with a trained GraphStorm model. We have improved our distributed graph processing pipeline to provide multiple feature transformations including categorical feature transformation, numerical bucketing, etc. We added GAT model in GraphStorm model zoo. We also added a demo of running GraphStorm using Jupyter Notebook.
Major features
- Automatically map inference results (prediction results and generated node embeddings) into Raw Node ID space (#481, #524, #527, #543, #533, #578, #597, #621, #633, #641)
- Provide a command line to generate GNN embeddings (#478)
- Provide multiple feature transformations include categorical feature transformation (#623), Rank-Gauss (#615), numerical bucketing (#583), Min/Max normalization (#575)
Minor features
- Support caching BERT embeddings on disks for GNN model fine-tuning. #516
- Allows customization of GLEM trainable parameters grouping. #506
- Support using NVidia WholeGraph to store edge features #555
- Add contrastive loss for link prediction tasks #619
- Support in-batch negative for link prediction tasks #596
- Support NCCL backed for sparse embedding #549
New Built-in Models
Breaking changes
We changed the file format and the content of saved node embeddings and saved prediction results of GraphStorm training and inference pipelines. By default, if the task is launch through a command under graphstorm.run.*, GraphStorm will automatically save generated node embeddings and prediction results in parquet files. For node embeddings, the files will contain two columns: column “nid” storing the node IDs in the raw node ID space and column “emb” storing the node embeddings. For node prediction results, the files will contain two columns: column “nid” storing the node IDs in the raw node ID space and column “pred” storing the prediction results. For edge prediction results, the files will contain three columns: column “src_nid” and “dst_nid” storing the node IDs of source nodes and destination nodes in the raw node ID space respectively and column “pred” storing the prediction results.
Contributors
- Da Zheng from AWS
- Xiang Song from AWS
- Jian Zhang from AWS
- Theodore Vasiloudis from AWS
- Runjie Ma from AWS
- Israt Nisa from AWS
- Qi Zhu from AWS
- Zichen Wang from AWS
- Nicolas Castet from NVidia
- Chang Liu from NVidia