Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Do you consider model accuracy Issues and how to get the whole trainig time of AI training process? #70

Open
horser1 opened this issue Jan 13, 2025 · 2 comments

Comments

@horser1
Copy link

horser1 commented Jan 13, 2025

hello, thanks for your excellent work on large scale AI training simulation. And I'm curious if you're considering model accuracy issues. Do the parameters you provide for modification have an impact on model accuracy?
you said it can "Evaluate the time consumption of AI tasks", as far as I'm known, Astra-sim can only get the time of a single batch but not the whole training process(because it is related to the number of epochs, and so on...). So I'm also curious that how do you think about this problem?

@Huoyuan100861
Copy link
Collaborator

hello, thanks for your excellent work on large scale AI training simulation. And I'm curious if you're considering model accuracy issues. Do the parameters you provide for modification have an impact on model accuracy? you said it can "Evaluate the time consumption of AI tasks", as far as I'm known, Astra-sim can only get the time of a single batch but not the whole training process(because it is related to the number of epochs, and so on...). So I'm also curious that how do you think about this problem?

  1. The model's performance and accuracy can vary with different parameters. For instance, if you split your model's parallelism to an extreme extent, it could lead to very small matrix multiplication dimensions, significantly reducing computational efficiency and causing high fluctuations in training time.
  2. Isn't the entire training process just a series of N global batch iterations? N is determined by the size of your training dataset and can be calculated through simple arithmetic. We welcome you to contribute a pull request to SimAI to improve this aspect.

@horser1
Copy link
Author

horser1 commented Jan 15, 2025

  1. The model's performance and accuracy can vary with different parameters. For instance, if you split your model's parallelism to an extreme extent, it could lead to very small matrix multiplication dimensions, significantly reducing computational efficiency and causing high fluctuations in training time.
  2. Isn't the entire training process just a series of N global batch iterations? N is determined by the size of your training dataset and can be calculated through simple arithmetic. We welcome you to contribute a pull request to SimAI to improve this aspect.

Thanks for your answers, yeal, the total time can indeed be obtained by a simple calculation of the time of a batch.
whole_time=single_batch_time*N*number of epochs
but some parameters may affect the convergence speed, leading to a higher number of epochs. After modifying some parameters, the number of epochs may change? So how to decide the number of epochs?
Besides, I wonder that have you considered the effect of modifying the parameters on the accuracy of the model? If modifying the parameters bring an increase in E2E performance, but results in a much worse model accuracy, then it seems that this optimization is pointless.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants