Release GraphStorm v0.3 release · awslabs/graphstorm

GraphStorm V0.3 release contains a few major feature enhancements. In this release, we have introduced support for multi-task learning, allowing users to define multiple training targets on different nodes and edges within a single training loop. The supported training supervisions for multi-task learning include node classification/regression, edge classification/regression, link prediction and node feature reconstruction. Users can specify the training targets through the YAML configuration file. We have refactored the implementation of DataLoader and Dataset to decouple Dataset from DataLoader, simplifying the customization of both. We simplified the APIs of DataLoader, Dataset and Evaluator. We have supported re-applying saved feature transformation rules to new data in distributed graph processing pipeline. We added GATv2 model in GraphStorm model zoo. We also added demos of running node classification and link prediction with custom GNN models using Jupyter Notebook.

Major features

Support graph multi-task learning. which enables users to define multiple training targets, including node classification/regression, edge classification/regression, link prediction and node feature reconstruction, in a single training loop. #804 #813 #825 #828 #842 #837 #834 #843 #852 #855 #860 #863 #871 #861
Refactor the implementations of DataLoader and Dataset to decouple Dataset from DataLoader and simplify the APIs of DataLoader, Dataset and Evaluator. #795 #820 #821 #822
Support re-applying saved feature transformation rules to new data in distributed graph processing pipeline #857 #870

New Examples

Add a Jupyter Notebook example for node classification using a custom GNN model #830
Add a Jupyter Notebook example for link prediction using a custom GNN model #846
Add link prediction support in GPEFT example #760
Add GraphStorm benchmarks using MAG and Amazon Review dataset. #765 #818

Minor features

Allow re-partitioning to run on the Spark leader, removing the need for a follow-up re-partition job. #767
Add support for custom graph splits, allowing users to define their own train/validation/test sets. #761
Allow custom out_dtype for numerical feature transformations in GSProcessing. #739

New Built-in Models

GATv2 #771

Breaking changes

GraphStorm API changes

Simplify graphstorm.initialize() by given default values, e.g. ip_config, backend and local_rank, (#781 #783)

The initialize() method adds default values, ip_config=None, backend='gloo, local_rank=0'.
The gsf.py adds a default device by using the local_rank so other class can call get_device() directly.

Refactor evaluators with new a base class and several interfaces for different tasks. (#803 #807 #822)

Deprecate GSgnnInstanceEvaluator, and GSgnnAccEvaluator with GSgnnBaseEvaluator, and GSgnnClassificationEvaluator. Refactor GSgnnRegressionEvaluator, GSgnnMrrLPEvaluator, and GSgnnPerEtypeLPEvaluator

Unify different GraphStorm data classess (GSgnnNodeTrainData, GSgnnNodeInferData, GSgnnEdgeTrainData, GSgnnEdgeInferData) with one GSgnnData and one set of constructor arguments. Deprecate GSgnnNodeTrainData, GSgnnNodeInferData, GSgnnEdgeTrainData, GSgnnEdgeInferData. GSgnnData now only provides interfaces for accessing graph data, e.g., node features, edge features, labels, train masks, etc. (#795 #820 #821)

Update the init arguments of GSgnnData from (graph_name, part_config, node_feat_field, edge_feat_field, decoder_edge_feat, lm_feat_ntypes, lm_feat_etypes) to (part_config, node_feat_field, edge_feat_field, lm_feat_ntypes, lm_feat_etypes).
Add property functions:
- graph_name(), return a string of the graph name value in config json.
Add new functions:
- get_node_feats(self, input_nodes, nfeat_fields, device='cpu'), given the node ids (input_nodes) and node features to retrieve from the graph data (nfeat_fields), return the corresponding node features.
- get_edge_feats(self, input_edges, efeat_fields, device='cpu'), given the edge ids (input_edges) and edge features to retrieve from the graph data (efeat_fields), return the corresponding edge features.
- get_node_train_set(self, ntypes, mask), return the node training set.
- get_node_val_set(self, ntypes, mask), return the node validation set.
- get_node_test_set(self, ntypes, mask), return the node test set.
- get_node_infer_set(self, ntypes, mask), return the node inference set.
- get_edge_train_set(self, etypes, mask, reverse_edge_types_map), return the edge training set.
- get_node_val_set(self, etypes, mask, reverse_edge_types_map), return the edge validation set.
- get_node_test_set(self, etypes, mask, reverse_edge_types_map), return the edge test set.
- get_node_infer_set(self, etypes, mask, reverse_edge_types_map), return the edge inference set.
Update some functions:
- get_node_feats(self, input_nodes, nfeat_fields, device='cpu'), requires the caller to provide the node feature fields to access
- get_edge_feats(self, input_edges, efeat_fields, device='cpu'), requires the caller to provide the edge feature fields to access
- get_labels function is replaced with get_node_feats and get_edge_feats with label field names.

Refactor all dataloader classes, adding new constructor arguments. (#795 #820 #821)

GSgnnNodeDataLoaderBase and its subclasses, requires three new init arguments:
- label_field: Label field of the node task.
- node_feats: Node feature fields used by the node task.
- edge_feats: Edge feature fields used by the node task.
GSgnnEdgeDataLoader and its subclasses, requires four new init arguments:
- label_field: Label field of the edge task.
- node_feats: Node feature fields used by the edge task.
- edge_feats: Edge feature fields used by the edge task.
- decoder_edge_feats: Edge feature fields used in the edge task decoder.
GSgnnLinkPredictionDataLoader and its subclasses, requires three new init arguments:
- node_feats: Node feature fields used by the link prediction task.
- edge_feats: Edge feature fields used by the link prediction task.
- pos_graph_edge_feats: The field of the edge features used by positive graph in link prediction.

GraphStorm GSProcessing updates

GSProcessing now supports re-applying saved feature transformation rules on new data. GSProcessing will now create a new file precomputed_transformations.json in the output location. Users can copy that file to the top-level path of new input data (at the same level as the input configuration JSON) and GSProcessing will use the existing transformations for the same features. This way, a model that has been trained on previous data can continue working even if new values appear in the new data. In this release, we only support re-applying categorical transformations.

Contributors

Da Zheng from AWS
Xiang Song from AWS
Jian Zhang from AWS
Theodore Vasiloudis from AWS
Runjie Ma from AWS
Qi Zhu from AWS

Special thanks to the DGL project and WholeGraph project for supporting GraphStorm 0.3 release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GraphStorm v0.3 release