GraphStorm V0.3 release contains a few major feature enhancements. In this release, we have introduced support for multi-task learning, allowing users to define multiple training targets on different nodes and edges within a single training loop. The supported training supervisions for multi-task learning include node classification/regression, edge classification/regression, link prediction and node feature reconstruction. Users can specify the training targets through the YAML configuration file. We have refactored the implementation of DataLoader and Dataset to decouple Dataset from DataLoader, simplifying the customization of both. We simplified the APIs of DataLoader, Dataset and Evaluator. We have supported re-applying saved feature transformation rules to new data in distributed graph processing pipeline. We added GATv2 model in GraphStorm model zoo. We also added demos of running node classification and link prediction with custom GNN models using Jupyter Notebook.
Major features
- Support graph multi-task learning. which enables users to define multiple training targets, including node classification/regression, edge classification/regression, link prediction and node feature reconstruction, in a single training loop. #804 #813 #825 #828 #842 #837 #834 #843 #852 #855 #860 #863 #871 #861
- Refactor the implementations of DataLoader and Dataset to decouple Dataset from DataLoader and simplify the APIs of DataLoader, Dataset and Evaluator. #795 #820 #821 #822
- Support re-applying saved feature transformation rules to new data in distributed graph processing pipeline #857 #870
New Examples
- Add a Jupyter Notebook example for node classification using a custom GNN model #830
- Add a Jupyter Notebook example for link prediction using a custom GNN model #846
- Add link prediction support in GPEFT example #760
- Add GraphStorm benchmarks using MAG and Amazon Review dataset. #765 #818
Minor features
- Allow re-partitioning to run on the Spark leader, removing the need for a follow-up re-partition job. #767
- Add support for custom graph splits, allowing users to define their own train/validation/test sets. #761
- Allow custom out_dtype for numerical feature transformations in GSProcessing. #739
New Built-in Models
- GATv2 #771
Breaking changes
GraphStorm API changes
Simplify graphstorm.initialize() by given default values, e.g. ip_config, backend and local_rank, (#781 #783)
- The initialize() method adds default values, ip_config=None, backend='gloo, local_rank=0'.
- The gsf.py adds a default device by using the local_rank so other class can call get_device() directly.
Refactor evaluators with new a base class and several interfaces for different tasks. (#803 #807 #822)
- Deprecate GSgnnInstanceEvaluator, and GSgnnAccEvaluator with GSgnnBaseEvaluator, and GSgnnClassificationEvaluator. Refactor GSgnnRegressionEvaluator, GSgnnMrrLPEvaluator, and GSgnnPerEtypeLPEvaluator
Unify different GraphStorm data classess (GSgnnNodeTrainData, GSgnnNodeInferData, GSgnnEdgeTrainData, GSgnnEdgeInferData) with one GSgnnData and one set of constructor arguments. Deprecate GSgnnNodeTrainData, GSgnnNodeInferData, GSgnnEdgeTrainData, GSgnnEdgeInferData. GSgnnData now only provides interfaces for accessing graph data, e.g., node features, edge features, labels, train masks, etc. (#795 #820 #821)
- Update the init arguments of GSgnnData from (graph_name, part_config, node_feat_field, edge_feat_field, decoder_edge_feat, lm_feat_ntypes, lm_feat_etypes) to (part_config, node_feat_field, edge_feat_field, lm_feat_ntypes, lm_feat_etypes).
- Add property functions:
- graph_name(), return a string of the graph name value in config json.
- Add new functions:
- get_node_feats(self, input_nodes, nfeat_fields, device='cpu'), given the node ids (input_nodes) and node features to retrieve from the graph data (nfeat_fields), return the corresponding node features.
- get_edge_feats(self, input_edges, efeat_fields, device='cpu'), given the edge ids (input_edges) and edge features to retrieve from the graph data (efeat_fields), return the corresponding edge features.
- get_node_train_set(self, ntypes, mask), return the node training set.
- get_node_val_set(self, ntypes, mask), return the node validation set.
- get_node_test_set(self, ntypes, mask), return the node test set.
- get_node_infer_set(self, ntypes, mask), return the node inference set.
- get_edge_train_set(self, etypes, mask, reverse_edge_types_map), return the edge training set.
- get_node_val_set(self, etypes, mask, reverse_edge_types_map), return the edge validation set.
- get_node_test_set(self, etypes, mask, reverse_edge_types_map), return the edge test set.
- get_node_infer_set(self, etypes, mask, reverse_edge_types_map), return the edge inference set.
- Update some functions:
- get_node_feats(self, input_nodes, nfeat_fields, device='cpu'), requires the caller to provide the node feature fields to access
- get_edge_feats(self, input_edges, efeat_fields, device='cpu'), requires the caller to provide the edge feature fields to access
- get_labels function is replaced with get_node_feats and get_edge_feats with label field names.
Refactor all dataloader classes, adding new constructor arguments. (#795 #820 #821)
- GSgnnNodeDataLoaderBase and its subclasses, requires three new init arguments:
- label_field: Label field of the node task.
- node_feats: Node feature fields used by the node task.
- edge_feats: Edge feature fields used by the node task.
- GSgnnEdgeDataLoader and its subclasses, requires four new init arguments:
- label_field: Label field of the edge task.
- node_feats: Node feature fields used by the edge task.
- edge_feats: Edge feature fields used by the edge task.
- decoder_edge_feats: Edge feature fields used in the edge task decoder.
- GSgnnLinkPredictionDataLoader and its subclasses, requires three new init arguments:
- node_feats: Node feature fields used by the link prediction task.
- edge_feats: Edge feature fields used by the link prediction task.
- pos_graph_edge_feats: The field of the edge features used by positive graph in link prediction.
GraphStorm GSProcessing updates
GSProcessing now supports re-applying saved feature transformation rules on new data. GSProcessing will now create a new file precomputed_transformations.json in the output location. Users can copy that file to the top-level path of new input data (at the same level as the input configuration JSON) and GSProcessing will use the existing transformations for the same features. This way, a model that has been trained on previous data can continue working even if new values appear in the new data. In this release, we only support re-applying categorical transformations.
Contributors
- Da Zheng from AWS
- Xiang Song from AWS
- Jian Zhang from AWS
- Theodore Vasiloudis from AWS
- Runjie Ma from AWS
- Qi Zhu from AWS
Special thanks to the DGL project and WholeGraph project for supporting GraphStorm 0.3 release.