-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
AllGather Message Size vs Workload TXT File Msg Size #74
Comments
For example, in the following workload: The initial This single operation totally dominates the training time and the rest of the operations just finish quickly. It makes me wonder whether AllGather sizes need division by number of ranks in the TXT workload generator. Or this how it really works? 😯
|
In SimAI's workload with the format [Collective Comm] [Size], the Size refers to the data for Global rank. For instance, in an AllGather operation, the data per rank would be Size divided by the number of ranks. This aligns with the convention in framework discussions about the volume of communication in a collective operation. However, in the NCCL interface, the count specified refers to the amount for a single rank, so you need to be aware of this distinction. |
Thanks for the response. Is this correct then? Workload generator specifies AllGather size based on the total model params: https://github.com/aliyun/aicb/blob/d9b4f5cd7d9d34a80cfbb0389831a16c7fe3ed7b/workload_generator/AIOB_simAI_workload_generator.py#L571C1-L572C1 SimAI's Workload class reads this directly and for example NcclTreeFlowModel uses it as initial size for AllGather:
Either Aicb or the AllGather input in the simulator might need a correction, isn't it? Unless, the training process actually requires every GPU to send out entire model params. I am not sure how it works, given the parallelism strategy. Any clarification would be very helpful to understand. Thanks! |
In my onion the payload size for each layer e.g., for ALLREDUCE is the buffer payload P, that a GPU has and it is further divided by the number of TP(=4 lets assume) group (let's say in FWD) and will become P/4 for each GPU to send to neighbor GPU and Receive from another neighbor GPU, such that the Final data size for each GPU will be updated but same as P. And this is correct: this->final_data_size = data_size * nodes_in_ring; because the data_size is P/4 and nodes_in_ring is 4, so final data_size will be P. Its only the data_size that was used for communication. |
Hi,
The workload file specifies an AllGather communication with message size computed based on the total params of the model. I am assuming this is the size of the data received at each node after gathering?
The simulator reads this message size generates AllGather using it as initial data size. The final data received at each node becomes number of ranks multipled by the initial message size specified in the TXT workload.
Is this intended? I am wondering whether the AllGather specified in the TXT file needs a division by the number of ranks (world_size).
Please let me know.
--
Vamsi
The text was updated successfully, but these errors were encountered: