Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] can not initialize DeepSpeed-Inference engine with deepspeed.init_inference() #2149

Closed
Jirigesi opened this issue Jul 27, 2022 · 2 comments
Assignees
Labels
bug Something isn't working inference

Comments

@Jirigesi
Copy link

Hello,
I am new user of the DeepSpeed(DS) and I successfully trained checkpoints using DS. However, I met issue when trying to used the checkpoint for inference. I want to use the tutorial by this, however, I tried to give the folder of *.pt file or to the .pt file. I always get this error

Traceback (most recent call last):
File "deepspeed_infer2.py", line 28, in
ds_engine = deepspeed.init_inference(model,
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/init.py", line 288, in init_inference
engine = InferenceEngine(model,
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init
self._apply_injection_policy(
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 316, in _apply_injection_policy
checkpoint = SDLoaderFactory.get_sd_loader_json(
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/state_dict_factory.py", line 23, in get_sd_loader_json
ckpt_list = data['checkpoints']
KeyError: 'checkpoints'
[2022-07-27 22:48:51,258] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 93887
[2022-07-27 22:48:51,258] [ERROR] [launch.py:184:sigkill_handler] ['/home/jirigesi/anaconda3/envs/deepspeed/bin/python', '-u', 'deepspeed_infer2.py', '--local_rank=0'] exits with return code = 1

This is my checkpoint.json:

{
    "type": "DeepSpeed",
      "version": 0.3,
      "checkpoint_path": "./ds_models/global_step1/mp_rank_00_model_states.pt"
  }

this is code i used to get the inference engine:

# Initialize the DeepSpeed-Inference engine
    ds_engine = deepspeed.init_inference(model,
                                    dtype=torch.half,
                                    checkpoint="checkpoint.json",
                                    replace_method='auto',
                                    replace_with_kernel_inject=True)

I can use another approach to load the checkpoint:

#Initialize the DeepSpeed-Inference engine
 model_engine, _, _, _ = deepspeed.initialize(
                                             model=model, 
                                             model_parameters=model.parameters(), 
                                             config=ds_config
                                             )

 # load checkpoint 
 load_dir = '../results/ds_models/global_step226'
 #load checkpoint
 _, client_sd = model_engine.load_checkpoint(load_dir)

and use this new model_engine for inference. I am not sure what is the difference between two methods? and why first approach is not working?

@Jirigesi Jirigesi added the bug Something isn't working label Jul 27, 2022
@mrwyattii
Copy link
Contributor

@Jirigesi Thanks for using DeepSpeed! I believe the problem when using init_inference is that your checkpoint.json is missing the key checkpoints:

KeyError: 'checkpoints'

Try replacing checkpoint_path with checkpoints:

{
    "type": "DeepSpeed",
    "version": 0.3,
    "checkpoints": ["./ds_models/global_step1/mp_rank_00_model_states.pt"]
}

@lekurile
Copy link
Contributor

lekurile commented Mar 8, 2023

Hello @Jirigesi,

Apologies for the delayed follow up to your issue. The inference tutorial is slightly out of date with the code. For checkpoint loading to work using a checkpoint.json as described in the tutorial, replace_with_kernel_inject must be False due to this check in the InferenceEngine:
https://github.com/microsoft/DeepSpeed/blob/58a4a4d4c19bda86d489ac171fa10f3ddb27c9d6/deepspeed/inference/engine.py#L95
This check was added in GH-2083 along with the Meta Tensors feature, which uses "meta tensors" to initialize the model, then loads the weights after module replacement.

The GH-2940 draft PR changes the InferenceEngine check in the code snippet above to more explicitly check for meta tensor usage, allowing checkpoints to be loaded as described in the tutorial. We're also looking to update the tutorial as well to reflect the current state of checkpoint loading.

Please let us know if you have any additional questions!

Thanks,
Lev

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working inference
Projects
None yet
Development

No branches or pull requests

4 participants