You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I am new user of the DeepSpeed(DS) and I successfully trained checkpoints using DS. However, I met issue when trying to used the checkpoint for inference. I want to use the tutorial by this, however, I tried to give the folder of *.pt file or to the .pt file. I always get this error
Traceback (most recent call last):
File "deepspeed_infer2.py", line 28, in
ds_engine = deepspeed.init_inference(model,
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/init.py", line 288, in init_inference
engine = InferenceEngine(model,
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init
self._apply_injection_policy(
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 316, in _apply_injection_policy
checkpoint = SDLoaderFactory.get_sd_loader_json(
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/state_dict_factory.py", line 23, in get_sd_loader_json
ckpt_list = data['checkpoints']
KeyError: 'checkpoints'
[2022-07-27 22:48:51,258] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 93887
[2022-07-27 22:48:51,258] [ERROR] [launch.py:184:sigkill_handler] ['/home/jirigesi/anaconda3/envs/deepspeed/bin/python', '-u', 'deepspeed_infer2.py', '--local_rank=0'] exits with return code = 1
The GH-2940 draft PR changes the InferenceEngine check in the code snippet above to more explicitly check for meta tensor usage, allowing checkpoints to be loaded as described in the tutorial. We're also looking to update the tutorial as well to reflect the current state of checkpoint loading.
Please let us know if you have any additional questions!
Hello,
I am new user of the DeepSpeed(DS) and I successfully trained checkpoints using DS. However, I met issue when trying to used the checkpoint for inference. I want to use the tutorial by this, however, I tried to give the folder of *.pt file or to the .pt file. I always get this error
Traceback (most recent call last):
File "deepspeed_infer2.py", line 28, in
ds_engine = deepspeed.init_inference(model,
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/init.py", line 288, in init_inference
engine = InferenceEngine(model,
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init
self._apply_injection_policy(
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 316, in _apply_injection_policy
checkpoint = SDLoaderFactory.get_sd_loader_json(
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/state_dict_factory.py", line 23, in get_sd_loader_json
ckpt_list = data['checkpoints']
KeyError: 'checkpoints'
[2022-07-27 22:48:51,258] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 93887
[2022-07-27 22:48:51,258] [ERROR] [launch.py:184:sigkill_handler] ['/home/jirigesi/anaconda3/envs/deepspeed/bin/python', '-u', 'deepspeed_infer2.py', '--local_rank=0'] exits with return code = 1
This is my checkpoint.json:
this is code i used to get the inference engine:
I can use another approach to load the checkpoint:
and use this new model_engine for inference. I am not sure what is the difference between two methods? and why first approach is not working?
The text was updated successfully, but these errors were encountered: