Skip to content

Releases: awslabs/graphstorm

V0.3.1 Release Note

19 Aug 17:27
Compare
Choose a tag to compare

The GraphStorm V0.3.1 release contains a few major feature enhancements. In this version, we have reorganized the overall documentation and tutorial to facilitate a more efficient learning curve for users. The new documentation is organized into four sections: i) Getting Started, which offers a concise tutorial on usinh GraphStorm; ii) Command Line Interface User Guide, which provides an overview of the GraphStorm command line interfaces (CLI); iii) Programming Interface User Guide, which provides details the application programming interfaces (API) of GraphStorm; and vi) ) Advanced Topics, which explores complex subjects such as custom model implementation, link prediction training optimization, multi-task learning, etc. In addition, we have enhanced the distributed graph processing functionalities to improve user experience. We provided four notebook examples to demonstrate the use of GraphStorm APIs in developing custom models and training/inference pipelines.

Major features

  • Reorganized the documentations and tutorials to group the main contents under two top-level menus, i.e., COMMAND LINE INTERFACE USER GUIDE and PROGRAMMING INTERFACE USER GUIDE. #956
    • Under the CLI user guide menu, regrouped the contents in into two 2nd-level menus, i.e., GraphStorm Graph Construction and GraphStorm Model Training and Inference.
      • Under the GraphStorm Graph Construction, added a new document, Input Raw Data Specification, to explain the specifications of the input data, and provide a simple raw data example. #996
      • Added a new document, Single Machine Graph Construction, to introduce the gconstruct module, and provide a simple construction configuration JSON example. #996
      • In the Distributed Graph Construction, reorganized the document structure of GSProcessing. #907
    • Renamed the DISTRIBUTED TRAINING to GraphStorm Model Training and Inference and move it under COMMAND LINE INTERFACE USER GUIDE. #956
      • Added a new Model Training and Inference on a Single Machine 2nd-level menu to explain the launch commands.
        • Moved the Model Training and Inference Configurations section under it. #969
        • Added a new GraphStorm Training and Inference Output section to explain the intermediate outputs. #964
        • Added a new GraphStorm Output Node ID Remapping section to explain the CLIs output and the remapping operation. #970
    • Under the PROGRAMMING INTERFACE USER GUIDE menu,
    • Refined hard negative tutorial and multi-task learning tutorial. #898 #944
  • Added a new GSProcessing launch script for EMR on EC2 that allows users to run a GSProcessing job as an EMR step, simplifying the user experience. #902

New examples

  • Add a Jupyter Notebook example for using GraphStorm APIs to implement GraphStorm built-in GNN model #919
  • Add a Jupyter Notebook example for using GraphStorm APIs to customize GNN model components #929

Minor features

  • Add a hit@k evaluator for both classification and link prediction tasks. #911 #948
  • Remove the limit that save model frequency must be dividable by the evaluation frequency. Allow users to set the save model frequency freely. #893 #948
  • Added a new truncate_dim argument to GSProcessing no-op transformation and for gconstruct.construct_graph too. #922

Breaking changes

  • Add a new argument norm in the __init__ of GraphStorm classification and regression decoders. This allows users to set layer or batch normalization on the neural network layers of these decoders. Only MLPFeatEdgeDecoder implements the normalization in this release. #948
  • Rename the pos_graph_feat_fields with pos_graph_edge_feat_fields in the GSgnnLinkPredictionDataLoaderBase class to make its meaning clearer. #934

Contributors

GraphStorm v0.3 release

24 Jun 18:25
Compare
Choose a tag to compare

GraphStorm V0.3 release contains a few major feature enhancements. In this release, we have introduced support for multi-task learning, allowing users to define multiple training targets on different nodes and edges within a single training loop. The supported training supervisions for multi-task learning include node classification/regression, edge classification/regression, link prediction and node feature reconstruction. Users can specify the training targets through the YAML configuration file. We have refactored the implementation of DataLoader and Dataset to decouple Dataset from DataLoader, simplifying the customization of both. We simplified the APIs of DataLoader, Dataset and Evaluator. We have supported re-applying saved feature transformation rules to new data in distributed graph processing pipeline. We added GATv2 model in GraphStorm model zoo. We also added demos of running node classification and link prediction with custom GNN models using Jupyter Notebook.

Major features

  • Support graph multi-task learning. which enables users to define multiple training targets, including node classification/regression, edge classification/regression, link prediction and node feature reconstruction, in a single training loop. #804 #813 #825 #828 #842 #837 #834 #843 #852 #855 #860 #863 #871 #861
  • Refactor the implementations of DataLoader and Dataset to decouple Dataset from DataLoader and simplify the APIs of DataLoader, Dataset and Evaluator. #795 #820 #821 #822
  • Support re-applying saved feature transformation rules to new data in distributed graph processing pipeline #857 #870

New Examples

  • Add a Jupyter Notebook example for node classification using a custom GNN model #830
  • Add a Jupyter Notebook example for link prediction using a custom GNN model #846
  • Add link prediction support in GPEFT example #760
  • Add GraphStorm benchmarks using MAG and Amazon Review dataset. #765 #818

Minor features

  • Allow re-partitioning to run on the Spark leader, removing the need for a follow-up re-partition job. #767
  • Add support for custom graph splits, allowing users to define their own train/validation/test sets. #761
  • Allow custom out_dtype for numerical feature transformations in GSProcessing. #739

New Built-in Models

Breaking changes

GraphStorm API changes

Simplify graphstorm.initialize() by given default values, e.g. ip_config, backend and local_rank, (#781 #783)

  • The initialize() method adds default values, ip_config=None, backend='gloo, local_rank=0'.
  • The gsf.py adds a default device by using the local_rank so other class can call get_device() directly.

Refactor evaluators with new a base class and several interfaces for different tasks. (#803 #807 #822)

  • Deprecate GSgnnInstanceEvaluator, and GSgnnAccEvaluator with GSgnnBaseEvaluator, and GSgnnClassificationEvaluator. Refactor GSgnnRegressionEvaluator, GSgnnMrrLPEvaluator, and GSgnnPerEtypeLPEvaluator

Unify different GraphStorm data classess (GSgnnNodeTrainData, GSgnnNodeInferData, GSgnnEdgeTrainData, GSgnnEdgeInferData) with one GSgnnData and one set of constructor arguments. Deprecate GSgnnNodeTrainData, GSgnnNodeInferData, GSgnnEdgeTrainData, GSgnnEdgeInferData. GSgnnData now only provides interfaces for accessing graph data, e.g., node features, edge features, labels, train masks, etc. (#795 #820 #821)

  • Update the init arguments of GSgnnData from (graph_name, part_config, node_feat_field, edge_feat_field, decoder_edge_feat, lm_feat_ntypes, lm_feat_etypes) to (part_config, node_feat_field, edge_feat_field, lm_feat_ntypes, lm_feat_etypes).
  • Add property functions:
    • graph_name(), return a string of the graph name value in config json.
  • Add new functions:
    • get_node_feats(self, input_nodes, nfeat_fields, device='cpu'), given the node ids (input_nodes) and node features to retrieve from the graph data (nfeat_fields), return the corresponding node features.
    • get_edge_feats(self, input_edges, efeat_fields, device='cpu'), given the edge ids (input_edges) and edge features to retrieve from the graph data (efeat_fields), return the corresponding edge features.
    • get_node_train_set(self, ntypes, mask), return the node training set.
    • get_node_val_set(self, ntypes, mask), return the node validation set.
    • get_node_test_set(self, ntypes, mask), return the node test set.
    • get_node_infer_set(self, ntypes, mask), return the node inference set.
    • get_edge_train_set(self, etypes, mask, reverse_edge_types_map), return the edge training set.
    • get_node_val_set(self, etypes, mask, reverse_edge_types_map), return the edge validation set.
    • get_node_test_set(self, etypes, mask, reverse_edge_types_map), return the edge test set.
    • get_node_infer_set(self, etypes, mask, reverse_edge_types_map), return the edge inference set.
  • Update some functions:
    • get_node_feats(self, input_nodes, nfeat_fields, device='cpu'), requires the caller to provide the node feature fields to access
    • get_edge_feats(self, input_edges, efeat_fields, device='cpu'), requires the caller to provide the edge feature fields to access
    • get_labels function is replaced with get_node_feats and get_edge_feats with label field names.

Refactor all dataloader classes, adding new constructor arguments. (#795 #820 #821)

  • GSgnnNodeDataLoaderBase and its subclasses, requires three new init arguments:
    • label_field: Label field of the node task.
    • node_feats: Node feature fields used by the node task.
    • edge_feats: Edge feature fields used by the node task.
  • GSgnnEdgeDataLoader and its subclasses, requires four new init arguments:
    • label_field: Label field of the edge task.
    • node_feats: Node feature fields used by the edge task.
    • edge_feats: Edge feature fields used by the edge task.
    • decoder_edge_feats: Edge feature fields used in the edge task decoder.
  • GSgnnLinkPredictionDataLoader and its subclasses, requires three new init arguments:
    • node_feats: Node feature fields used by the link prediction task.
    • edge_feats: Edge feature fields used by the link prediction task.
    • pos_graph_edge_feats: The field of the edge features used by positive graph in link prediction.

GraphStorm GSProcessing updates

GSProcessing now supports re-applying saved feature transformation rules on new data. GSProcessing will now create a new file precomputed_transformations.json in the output location. Users can copy that file to the top-level path of new input data (at the same level as the input configuration JSON) and GSProcessing will use the existing transformations for the same features. This way, a model that has been trained on previous data can continue working even if new values appear in the new data. In this release, we only support re-applying categorical transformations.

Contributors

Special thanks to the DGL project and WholeGraph project for supporting GraphStorm 0.3 release.

V 0.2.2 Release Note

26 Feb 20:35
Compare
Choose a tag to compare

GraphStorm V0.2.2 release contains a few major feature enhancements. In this release, we have enhanced the NVIDIA WholeGraph support to speed up the access to learnable embedding training and cached BERT embedding. We have added customized negative sampling method for link prediction tasks, which enables users to define negative edges for each individual edge. We have provided two new feature transformations in our distributed graph processing pipeline, including textual feature tokenization with HuggingFace models and textual feature encoding with HuggingFace models. We further simplified the command line interface for model prototyping by removing the requirement of setting up ssh for running GraphStorm jobs on a single machine. We also added an example of doing GPEFT training to enhance LLM with graph data using the custom model interface.

Major features

  • Support using WholeGraph distributed embedding to speedup learnable embedding training. #677 #697 #734
  • Support using WholeGraph distributed embedding to speedup cached BERT embedding read. #737
  • Support hard negative for link prediction tasks. #678 #684 #703
  • Distributed graph processing pipeline supports using HuggingFace models to encode textual node features #724
  • Distributed graph processing pipeline supports using HuggingFace models to tokenize textual node features #700
  • Support running GraphStorm jobs on a single machine without using ssh. #712

New Examples

  • Add the GPEFT method to enhance LLM with graph data as a GraphStorm example using the custom model interface. It trains a GNN model to encode the neighborhood of a target node as a prompt and perform parameter efficient fine-tuning (PEFT) to enhance LLM for computing the node representation of the target node. See GPEFT example for how to run. #673 #701

Minor features

  • Add a support to balance training/valid/test in graph partitioning for node classification tasks. #714 #741
  • Allow users to start training/inferring job without specifying target_ntype/target_etype on homogeneous graph. #686 #683
  • Unify the ID mapping output of GConstruct and GProcessing. #461

Breaking changes

  • Previously, GConstruct created the ID mappings as a single Parquet file, with its filename prefixed by the node type. After 0.2.2 release, GConstruct will create partitioned Parquet files for each node type under its own directory. This change unifies the output of GConstruct and GProcessing. See more details in #461.
  • We unify the behavior of handling errors in evaluation functions. Previous, evaluation functions, such as roc_auc or f1 score will not raise an exception when an error happens. After 0.2.2 release, evaluation functions will stop the code running and raise an exception with a corresponding error message when an error happens. See more details in #711.

Contributors

GraphStorm v0.2.1 release

28 Nov 19:28
Compare
Choose a tag to compare

GraphStorm V0.2.1 release contains a few major feature enhancements. In this release, we have enhanced the GraphStorm model inference use experience by automatically mapping inference results (prediction results and generated node embeddings) into Raw Node ID space, i.e., the same ID space as the input raws data. The resulting output will be stored in parquet format. We have added a new inference command (graphstorm.run.gs_gen_node_embedding) for computing node embeddings on any given graph with a trained GraphStorm model. We have improved our distributed graph processing pipeline to provide multiple feature transformations including categorical feature transformation, numerical bucketing, etc. We added GAT model in GraphStorm model zoo. We also added a demo of running GraphStorm using Jupyter Notebook.

Major features

  • Automatically map inference results (prediction results and generated node embeddings) into Raw Node ID space (#481, #524, #527, #543, #533, #578, #597, #621, #633, #641)
  • Provide a command line to generate GNN embeddings (#478)
  • Provide multiple feature transformations include categorical feature transformation (#623), Rank-Gauss (#615), numerical bucketing (#583), Min/Max normalization (#575)

Minor features

  • Support caching BERT embeddings on disks for GNN model fine-tuning. #516
  • Allows customization of GLEM trainable parameters grouping. #506
  • Support using NVidia WholeGraph to store edge features #555
  • Add contrastive loss for link prediction tasks #619
  • Support in-batch negative for link prediction tasks #596
  • Support NCCL backed for sparse embedding #549

New Built-in Models

Breaking changes

We changed the file format and the content of saved node embeddings and saved prediction results of GraphStorm training and inference pipelines. By default, if the task is launch through a command under graphstorm.run.*, GraphStorm will automatically save generated node embeddings and prediction results in parquet files. For node embeddings, the files will contain two columns: column “nid” storing the node IDs in the raw node ID space and column “emb” storing the node embeddings. For node prediction results, the files will contain two columns: column “nid” storing the node IDs in the raw node ID space and column “pred” storing the prediction results. For edge prediction results, the files will contain three columns: column “src_nid” and “dst_nid” storing the node IDs of source nodes and destination nodes in the raw node ID space respectively and column “pred” storing the prediction results.

Contributors

GraphStorm v0.2 release

02 Oct 18:05
Compare
Choose a tag to compare

GraphStorm V0.2 release contains a few major features enhancement. In this release, we have added distributed graph processing support for large-scale graphs. Users may now use Spark clusters such as SageMaker, PySpark to execute distributed graph processing. We have added multi-task learning support for node classification tasks. Now, GraphStorm supports even more Huggingface language models (LM) Like bert, roberta, albert, etc.(See https://github.com/awslabs/graphstorm/blob/v0.2/python/graphstorm/model/lm_model/utils.py#L22) for more details. We have enhanced GraphStorm model training speed by supporting NCCL backend. Further performance enhancement to speedup node feature fetching during distributed GNN training by collaborating with Nvidia on NVidia WholeGraph support. We have expanded graph model support for distilling a GNN model into a Huggingface DistilBertModel, and added two new models HGT and GraphSage in GraphStorm model zoo. New GraphStorm doc and tutorial are available on https://graphstorm.readthedocs.io for all user group.

Major features

  • Support multi-task learning for node classification tasks (#410)
  • Enable NCCL backend (#383, #337)
  • Publish GraphStorm doc on https://graphstorm.readthedocs.io.
  • Support using multiple language models available in Huggingface including bert, roberta, albert, etc, in graph aware LM fine-tuning, GNN-LM co-training and GLEM. (#385)
  • [Experimental] Distributed graph processing support (#435, #427, #419, #408, #407, #400)
  • [Experimental] Support using NVidia WholeGraph to speedup node feature fetching during distributed GNN training. (#428, #405)
  • [Pre-View] Support for distilling a GNN model into a Huggingface DistilBertModel. (#443, #463)

New Built-in Models

  • Heterogeneous Graph Transformer (HGT) (#396)
  • GraphSage (#352)
  • [Experimental] GLEM semi-supervised training for node tasks. (#327, #432)

Minor features

  • Support per edge type link prediction metric report (#393)
  • Support per class roc-auc report for multi-label multi-class classification tasks (#397)
  • Support batch norm and layer norm (#384)
  • Enable standalone mode that allows users to run the training/inference scripts without using the launch script (#331)

API breaking changes

  • We changed the filename format of saved embeddings (either learnable embeddings or node embeddings) and model prediction results from .pt to <padding_zeros>.pt . For example, suppose we have 4 trainers, the saved node embeddings will be named as emb.part00000.pt, emb.part00001.pt, emb.part00002.pt, emb.part00003.pt.

Contributors

GraphStorm v0.1.2 release

07 Aug 17:09
Compare
Choose a tag to compare

V0.1.2 is a minor release of GraphStorm. In this release, we add CPU support for GML model training and inference, giving users more flexibility in choosing their environment settings. We add CSV data format support for graph construction pipeline, giving users more flexibility in choosing their input data format. We add four new dataloaders for link prediction tasks to avoid the slow neighbor sampling execution path in DGL. Together with DGL 1.0.4, we can get 2.4X speedup on training a 2 layer RGCN on MAG dataset on 4 g5.48x instances. We also add two new models GLEM and Network in graph neural network in GraphStorm model zoo.

Major features

  • Add CPU support for GML model training and inference. #300
  • Add CSV data format support for graph construction pipeline. #324
  • Speedup link prediction training. With DGL 1.0.4, we can get 2.4X speedup on training a 2 layer RGCN on MAG dataset on 4 g5.48x instances. #279 , #302

New Built-in Models

Enhancements

  • Optimize GraphStorm package dependencies #319
  • Allow Edge classification/regression inference tasks work with graphs without test masks #298
  • Add a MAE evaluation metric for regression tasks. #318
  • Allow passing of SageMaker Estimator arguments during job launch #350

Contributors

GraphStorm v0.1.1 release

26 Jun 18:43
34dd469
Compare
Choose a tag to compare

V0.1.1 is a minor release of GraphStorm. In this release, we add SageMaker support for graph construction, GML model training and inference to simplify the graph ML deployment. We also add multiple feature transformation methods in the graph construction pipeline to simplify graph data preparation. We provide a GraphStorm PyPI package for easy installation.

Major features

Enhancements

  • Support custom data split in graph construction. ( #41 )
  • Support categorial feature transformation in graph construction ( #50 )
  • Support Max-Min feature transformation in graph construction ( #299 )
  • Support rank gauss feature transformation in graph construction ( #242 )
  • Add weighted edge loss in link prediction training ( #63 )
  • Support using edge feature in edge classification and edge regression tasks ( #153 )
  • Add a profiler to help understand the runtime performance. ( #206 )

API breaking changes

  • We change the format of the value of --fanout and --eval-fanout. In v0.1.0 release, GraphStorm expects “relation_type0:fanout0@relation_type1:fanout1,relation_type0:fanout2@relation_type1:fanout3”, if one wants to specify the fanout for different edge type. Now we change it to “srcntype/relation0/dstntype:fanout0@srcntype/relation1/dstntype:fanout1,srcntype/relation0/dstntype:fanout2@srcntype/relation1/dstntype:fanout3”, so that graphs can have edges with same relation type but different source or destination node types.

Contributors

GraphStorm v0.1 release

23 May 06:54
Compare
Choose a tag to compare

V0.1 is the first official release of GraphStorm which provides an end2end user experiences of graph machine learning (GML) model training and inference starting from graph construction from data in Parquet or JSON format, to GML model training and inference. It provides a highly optimized and scalable pipeline for graph construction pipeline, model training and inference. Users can build graph data, train or inference a built-in GML model with a single command without writing any code. GraphStorm can scale GML training and inference on graphs with billions of nodes with multiple GPUs or multiple machines.

Major features

  • A single-machine data loading pipeline that accepts data stored in Parquet, JSON or HDF5 format, to build graph data for GML model training and inference. Users can build graph data with a single command without writing any code.
  • End-to-end training and inference pipelines that support common GML tasks including node classification, node regression, edge classification, edge regression and link prediction.
  • A collection of built-in GML models including RGCN, RGAT and LM-GNN. GraphStorm provides HGT as an custom model example.
  • A large collection of model configurations and training/inference configurations that allow users to tune GML model training without writing any code.
  • Scale GML training and inference on graphs with billions of nodes with multiple GPUs or multiple machines.
  • Support custom GML models. GraphStorm can scale user-defined models to billion-scale graphs with multiple GPUs and multiple machines.
  • Complete tutorial of graph construction, GML model training and GML model inference. (#Link to tutorial).
  • Huggingface BERT-GNN co-training and Graph-aware Huggingface BERT fine-tuning. Users can combine GML with Language models to either improve the performance of GML tasks or extend the expressiveness of LMs with graph information.
  • AWS native support. We provide a guideline to build GraphStorm Docker images for AWS EC2.(https://github.com/awslabs/graphstorm/tree/main/docker)

Contributors

GraphStorm v0.0.1 release

21 Mar 17:56
e850be1
Compare
Choose a tag to compare

v0.0.1 is the first release of GraphStorm which includes support for GNN models training on multi-GPU or multi-machine multi-GPU environments.

Major features

Contributors