Do you consider model accuracy Issues and how to get the whole trainig time of AI training process? #70

horser1 · 2025-01-13T07:15:46Z

hello, thanks for your excellent work on large scale AI training simulation. And I'm curious if you're considering model accuracy issues. Do the parameters you provide for modification have an impact on model accuracy?
you said it can "Evaluate the time consumption of AI tasks", as far as I'm known, Astra-sim can only get the time of a single batch but not the whole training process(because it is related to the number of epochs, and so on...). So I'm also curious that how do you think about this problem?

Huoyuan100861 · 2025-01-15T11:07:48Z

hello, thanks for your excellent work on large scale AI training simulation. And I'm curious if you're considering model accuracy issues. Do the parameters you provide for modification have an impact on model accuracy? you said it can "Evaluate the time consumption of AI tasks", as far as I'm known, Astra-sim can only get the time of a single batch but not the whole training process(because it is related to the number of epochs, and so on...). So I'm also curious that how do you think about this problem?

The model's performance and accuracy can vary with different parameters. For instance, if you split your model's parallelism to an extreme extent, it could lead to very small matrix multiplication dimensions, significantly reducing computational efficiency and causing high fluctuations in training time.
Isn't the entire training process just a series of N global batch iterations? N is determined by the size of your training dataset and can be calculated through simple arithmetic. We welcome you to contribute a pull request to SimAI to improve this aspect.

horser1 · 2025-01-15T12:04:42Z

The model's performance and accuracy can vary with different parameters. For instance, if you split your model's parallelism to an extreme extent, it could lead to very small matrix multiplication dimensions, significantly reducing computational efficiency and causing high fluctuations in training time.

Isn't the entire training process just a series of N global batch iterations? N is determined by the size of your training dataset and can be calculated through simple arithmetic. We welcome you to contribute a pull request to SimAI to improve this aspect.

Thanks for your answers, yeal, the total time can indeed be obtained by a simple calculation of the time of a batch.
whole_time=single_batch_time*N*number of epochs
but some parameters may affect the convergence speed, leading to a higher number of epochs. After modifying some parameters, the number of epochs may change? So how to decide the number of epochs?
Besides, I wonder that have you considered the effect of modifying the parameters on the accuracy of the model? If modifying the parameters bring an increase in E2E performance, but results in a much worse model accuracy, then it seems that this optimization is pointless.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do you consider model accuracy Issues and how to get the whole trainig time of AI training process? #70

Do you consider model accuracy Issues and how to get the whole trainig time of AI training process? #70

horser1 commented Jan 13, 2025

Huoyuan100861 commented Jan 15, 2025

horser1 commented Jan 15, 2025

Do you consider model accuracy Issues and how to get the whole trainig time of AI training process? #70

Do you consider model accuracy Issues and how to get the whole trainig time of AI training process? #70

Comments

horser1 commented Jan 13, 2025

Huoyuan100861 commented Jan 15, 2025

horser1 commented Jan 15, 2025