Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

AllGather Message Size vs Workload TXT File Msg Size #74

Open
vamsiDT opened this issue Jan 15, 2025 · 4 comments
Open

AllGather Message Size vs Workload TXT File Msg Size #74

vamsiDT opened this issue Jan 15, 2025 · 4 comments

Comments

@vamsiDT
Copy link

vamsiDT commented Jan 15, 2025

Hi,

The workload file specifies an AllGather communication with message size computed based on the total params of the model. I am assuming this is the size of the data received at each node after gathering?

The simulator reads this message size generates AllGather using it as initial data size. The final data received at each node becomes number of ranks multipled by the initial message size specified in the TXT workload.

Is this intended? I am wondering whether the AllGather specified in the TXT file needs a division by the number of ranks (world_size).

Please let me know.

--
Vamsi

@vamsiDT
Copy link
Author

vamsiDT commented Jan 15, 2025

For example, in the following workload:

The initial grad_gather specifies a size nearly 1.6GB. There are 256 Gpus in the topology. Based on how the simulator reads this, AllGather generates 1.6GB x 256 amount of data from each node (with ring or direct or halvingDoubling). In total, the simulator generates 1.6GB x 256 x 256 in total from all nodes.

This single operation totally dominates the training time and the rest of the operations just finish quickly.

It makes me wonder whether AllGather sizes need division by number of ranks in the TXT workload generator.

Or this how it really works? 😯

HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 8 ep: 1 pp: 1 vpp: 36 ga: 32 all_gpus: 256 checkpoints: 0 checkpoint_initiates: 0 pp_comm: 0 
2350                                                                                                                                                   
grad_gather     -1      1       NONE    0       1       NONE    0       1       ALLGATHER       1649410048      100                                    
grad_param_comm -1      1       NONE    0       1       NONE    0       1       REDUCESCATTER   3298820096      100                                    
grad_param_compute      -1      1       NONE    0       16011264        NONE    0       1       NONE    0       100                                    
layernorm       -1      1       NONE    0       1       ALLREDUCE       1649410048      1       NONE    0       100                                    
embedding_grads -1      1       NONE    0       1       ALLREDUCE       33554432        1       NONE    0       100                                    
moe_grad_norm1  -1      1       NONE    0       1       NONE    0       1       ALLGATHER_DP_EP 0       100                                            
moe_grad_norm2  -1      1       NONE    0       1       NONE    0       1       REDUCESCATTER_DP_EP     0       100                                    
embedding_layer -1      107004929       ALLREDUCE       33554432        1       NONE    0       8005632 NONE    0       100                            
attention_layer -1      968236  ALLREDUCE       33554432        968236  NONE    0       968236  NONE    0       100                                    
mlp_layer       -1      1013846 ALLREDUCE       33554432        1013846 NONE    0       1013846 NONE    0       100                                    
attention_layer -1      968236  ALLREDUCE       33554432        968236  NONE    0       968236  NONE    0       100                                    
mlp_layer       -1      1013846 ALLREDUCE       33554432        1013846 NONE    0       1013846 NONE    0       100                                    
attention_layer -1      968236  ALLREDUCE       33554432        968236  NONE    0       968236  NONE    0       100                                    
mlp_layer       -1      1013846 ALLREDUCE       33554432        1013846 NONE    0       1013846 NONE    0       100                                    
attention_layer -1      968236  ALLREDUCE       33554432        968236  NONE    0       968236  NONE    0       100                                    
mlp_layer       -1      1013846 ALLREDUCE       33554432        1013846 NONE    0       1013846 NONE    0       100                                    
attention_layer -1      968236  ALLREDUCE       33554432        968236  NONE    0       968236  NONE    0       100                                    
mlp_layer       -1      1013846 ALLREDUCE       33554432        1013846 NONE    0       1013846 NONE    0       100                                    
attention_layer -1      968236  ALLREDUCE       33554432        968236  NONE    0       968236  NONE    0       100                                    
mlp_layer       -1      1013846 ALLREDUCE       33554432        1013846 NONE    0       1013846 NONE    0       100                                    
attention_layer -1      968236  ALLREDUCE       33554432        968236  NONE    0       968236  NONE    0       100                                    


@Huoyuan100861
Copy link
Collaborator

Hi,你好

The workload file specifies an AllGather communication with message size computed based on the total params of the model. I am assuming this is the size of the data received at each node after gathering?工作负载文件指定 AllGather 通信,其消息大小根据模型的总参数计算。我假设这是收集每个节点接收的数据的大小?

The simulator reads this message size generates AllGather using it as initial data size. The final data received at each node becomes number of ranks multipled by the initial message size specified in the TXT workload.模拟器读取此消息大小会生成 AllGather,将其用作初始数据大小。在每个节点接收的最终数据将成为排名数乘以 TXT 工作负载中指定的初始消息大小。

Is this intended? I am wondering whether the AllGather specified in the TXT file needs a division by the number of ranks (world_size).这是有意为之的吗?我想知道 TXT 文件中指定的 AllGather 是否需要除以等级数 (world_size)。

Please let me know.请告诉我。

-- Vamsi瓦姆西

In SimAI's workload with the format [Collective Comm] [Size], the Size refers to the data for Global rank. For instance, in an AllGather operation, the data per rank would be Size divided by the number of ranks. This aligns with the convention in framework discussions about the volume of communication in a collective operation. However, in the NCCL interface, the count specified refers to the amount for a single rank, so you need to be aware of this distinction.

@vamsiDT
Copy link
Author

vamsiDT commented Jan 16, 2025

Thanks for the response.

Is this correct then?

Workload generator specifies AllGather size based on the total model params: https://github.com/aliyun/aicb/blob/d9b4f5cd7d9d34a80cfbb0389831a16c7fe3ed7b/workload_generator/AIOB_simAI_workload_generator.py#L571C1-L572C1

SimAI's Workload class reads this directly and for example NcclTreeFlowModel uses it as initial size for AllGather:

this->final_data_size = data_size * nodes_in_ring;

Either Aicb or the AllGather input in the simulator might need a correction, isn't it? Unless, the training process actually requires every GPU to send out entire model params. I am not sure how it works, given the parallelism strategy.

Any clarification would be very helpful to understand. Thanks!

@azharlightelligence
Copy link

azharlightelligence commented Feb 10, 2025

In my onion the payload size for each layer e.g., for ALLREDUCE is the buffer payload P, that a GPU has and it is further divided by the number of TP(=4 lets assume) group (let's say in FWD) and will become P/4 for each GPU to send to neighbor GPU and Receive from another neighbor GPU, such that the Final data size for each GPU will be updated but same as P. And this is correct: this->final_data_size = data_size * nodes_in_ring; because the data_size is P/4 and nodes_in_ring is 4, so final data_size will be P. Its only the data_size that was used for communication.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants