You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hello, thanks for your excellent work on large scale AI training simulation. And I'm curious if you're considering model accuracy issues. Do the parameters you provide for modification have an impact on model accuracy?
you said it can "Evaluate the time consumption of AI tasks", as far as I'm known, Astra-sim can only get the time of a single batch but not the whole training process(because it is related to the number of epochs, and so on...). So I'm also curious that how do you think about this problem?
The text was updated successfully, but these errors were encountered:
hello, thanks for your excellent work on large scale AI training simulation. And I'm curious if you're considering model accuracy issues. Do the parameters you provide for modification have an impact on model accuracy? you said it can "Evaluate the time consumption of AI tasks", as far as I'm known, Astra-sim can only get the time of a single batch but not the whole training process(because it is related to the number of epochs, and so on...). So I'm also curious that how do you think about this problem?
The model's performance and accuracy can vary with different parameters. For instance, if you split your model's parallelism to an extreme extent, it could lead to very small matrix multiplication dimensions, significantly reducing computational efficiency and causing high fluctuations in training time.
Isn't the entire training process just a series of N global batch iterations? N is determined by the size of your training dataset and can be calculated through simple arithmetic. We welcome you to contribute a pull request to SimAI to improve this aspect.
The model's performance and accuracy can vary with different parameters. For instance, if you split your model's parallelism to an extreme extent, it could lead to very small matrix multiplication dimensions, significantly reducing computational efficiency and causing high fluctuations in training time.
Isn't the entire training process just a series of N global batch iterations? N is determined by the size of your training dataset and can be calculated through simple arithmetic. We welcome you to contribute a pull request to SimAI to improve this aspect.
Thanks for your answers, yeal, the total time can indeed be obtained by a simple calculation of the time of a batch. whole_time=single_batch_time*N*number of epochs
but some parameters may affect the convergence speed, leading to a higher number of epochs. After modifying some parameters, the number of epochs may change? So how to decide the number of epochs?
Besides, I wonder that have you considered the effect of modifying the parameters on the accuracy of the model? If modifying the parameters bring an increase in E2E performance, but results in a much worse model accuracy, then it seems that this optimization is pointless.
hello, thanks for your excellent work on large scale AI training simulation. And I'm curious if you're considering model accuracy issues. Do the parameters you provide for modification have an impact on model accuracy?
you said it can "Evaluate the time consumption of AI tasks", as far as I'm known, Astra-sim can only get the time of a single batch but not the whole training process(because it is related to the number of epochs, and so on...). So I'm also curious that how do you think about this problem?
The text was updated successfully, but these errors were encountered: