Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Missing trainer_state.json #82

Open
anavarroa opened this issue Feb 21, 2025 · 0 comments
Open

Missing trainer_state.json #82

anavarroa opened this issue Feb 21, 2025 · 0 comments

Comments

@anavarroa
Copy link

I was training TinyZero when my process was unexpectedly interrupted due to another process consuming all available GPU memory. I want to resume training from the last checkpoint at global_step_4100, but I noticed that the trainer_state.json file is missing from my checkpoint directory.

  • My checkpoint directory contains the model weights and tokenizer files but not trainer_state.json.
  • I modified the training script to load my actor and critic models from the latest checkpoint:
    actor_rollout_ref.model.path="/path/to/checkpoints/TinyZero/test-run-4/actor/global_step_4100" critic.model.path="/path/to/checkpoints/TinyZero/test-run-4/critic/global_step_4100"
  • When restarting training, it does not pick up from step 4100. Instead, it starts from step 1 again.
  • I searched for trainer_state.json in my checkpoint directory using find, but it is not there.
  • I checked previous checkpoints, and they also do not contain trainer_state.json.

Is trainer_state.json necessary to resume training properly? And if so, is there a way to manually create or reconstruct it from the existing checkpoint files? Are there any settings I need to adjust in my training script to ensure proper resumption?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant