-
Notifications
You must be signed in to change notification settings - Fork 60
configuration configuration partition
Graph Partition#
For users who are already familiar with DGL and know how to construct DGL graph, GraphStorm provides two graph partition tools to partition DGL graphs into the required input format for GraphStorm launch tool for training and inference.
partition_graph.py: for Node/Edge Classification/Regress task graph partition.
partition_graph_lp.py: for Link Prediction task graph partition.
partition_graph.py arguments#
--dataset: (Required) the graph dataset name defined for the saved DGL graph file.
--filepath: (Required) the file path of the saved DGL graph file.
--target-ntype: the node type for making prediction, required for node classification/regression tasks. This argument is associated with the node type having labels. Current GraphStorm supports one predict node type only.
--ntype-task: the node type task to perform. Only support
classification
andregression
so far. Default isclassification
.--nlabel-field: the field that stores labels on the predict node type, required if set the target-ntype. The format is
nodetype:labelname
, e.g., “paper:label”.--target-etype: the canonical edge type for making prediction, required for edge classification/regression tasks. This argument is associated with the edge type having labels. Current GraphStorm supports one predict edge type only. The format is
src_ntype,etype,dst_ntype
, e.g., “author,write,paper”.--etype-task: the edge type task to perform. Only allow
classification
andregression
so far. Default isclassification
.--elabel-field: the field that stores labels on the predict edge type, required if set the target-etype. The format is
src_ntype,etype,dst_ntype:labelname
, e.g., “author,write,paper:label”.--generate-new-node-split: a boolean value, required if need the partition script to split nodes for training/validation/test sets. If set this argument
true
, must set the target-ntype argument too.--generate-new-edge-split: a boolean value, required if need the partition script to split edges for training/validation/test sets. If set this argument
true
, you must set the target-etype argument too.--train-pct: a float value (>0. and <1.) with default value
0.8
. If you want the partition script to split nodes/edges for training/validation/test sets, you can set this value to control the percentage of nodes/edges for training.--val-pct: a float value (>0. and <1.) with default value
0.1
. You can set this value to control the percentage of nodes/edges for validation.
Note
The sum of the train-pct and val-pct should be less than 1. And the percentage of test nodes/edges is the result of 1-(train_pct + val_pct).
--add-reverse-edges: if add this argument, will add reverse edges to the given graph.
--retain-original-features: boolean value to control if use the original features generated by dataset, e.g., embeddings of paper abstracts. If set to
true
, will keep the original features; otherwise we will use the tokenized text for using BERT models to generate embeddings.--num-parts: (Required) integer value that specifies partitions the DGL graph to be split. Remember this number because we will need to set it in the model training step.
--output: (Required) the folder path that the partitioned DGL graph will be saved.
partition_graph_lp.py arguments#
--dataset: (Required) the graph name defined for the saved DGL graph file.
--filepath: (Required) the file path of the saved DGL graph file.
--target-etypes: (Required) the canonical edge type for making prediction. GraphStorm supports one predict edge type only. The format is
src_ntype,etype,dst_ntype
, e.g., “author,write,paper”.--train-pct: a float value (>0. and <1.) with default value
0.8
. If you want the partition script to split nodes/edges for training/validation/test sets, you can set this value to control the percentage of nodes/edges for training.--val-pct: a float value (>0. and <1.) with default value
0.1
. You can set this value to control the percentage of nodes/edges for validation.
Note
The sum of the train-pct and val-pct should less than 1. And the percentage of test nodes/edges is the result of 1-(train_pct + val_pct).
--add-reverse-edges: if add this argument, will add reverse edges to the given graphs.
--train-graph-only: boolean value to control if partition the training graph or not, default is
true
.--retain-original-features: boolean value to control if use the original features generated by dataset, e.g., embeddings of paper abstracts. If set to
true
, will keep the original features; otherwise we will use the tokenized text for using BERT models to generate embeddings.--retain-etypes: the list of canonical edge type that will be retained before partitioning the graph. This might be helpful to remove noise edges in this application. Format example:
—-retain-etypes query,clicks,asin query,adds,asin query,purchases,asin asin,rev-clicks,query
.--num-parts: (Required) integer value that specifies partitions the DGL graph to be split. Remember this number because we will need to set it in the model training step.
--output: (Required) the folder path that the partitioned DGL graph will be saved.
Get Started
- Environment Setup
- Standalone Mode Quick Start Tutorial
- Use Your Own Data Tutorial
- GraphStorm Configurations
Scale to Giant Graphs
Advanced Topics