[BUG] can not initialize DeepSpeed-Inference engine with deepspeed.init_inference() #2149

Jirigesi · 2022-07-27T22:59:30Z

Hello,
I am new user of the DeepSpeed(DS) and I successfully trained checkpoints using DS. However, I met issue when trying to used the checkpoint for inference. I want to use the tutorial by this, however, I tried to give the folder of *.pt file or to the .pt file. I always get this error

Traceback (most recent call last):
File "deepspeed_infer2.py", line 28, in
ds_engine = deepspeed.init_inference(model,
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/init.py", line 288, in init_inference
engine = InferenceEngine(model,
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init
self._apply_injection_policy(
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 316, in _apply_injection_policy
checkpoint = SDLoaderFactory.get_sd_loader_json(
File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/state_dict_factory.py", line 23, in get_sd_loader_json
ckpt_list = data['checkpoints']
KeyError: 'checkpoints'
[2022-07-27 22:48:51,258] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 93887
[2022-07-27 22:48:51,258] [ERROR] [launch.py:184:sigkill_handler] ['/home/jirigesi/anaconda3/envs/deepspeed/bin/python', '-u', 'deepspeed_infer2.py', '--local_rank=0'] exits with return code = 1

This is my checkpoint.json:

{
    "type": "DeepSpeed",
      "version": 0.3,
      "checkpoint_path": "./ds_models/global_step1/mp_rank_00_model_states.pt"
  }

this is code i used to get the inference engine:

# Initialize the DeepSpeed-Inference engine
    ds_engine = deepspeed.init_inference(model,
                                    dtype=torch.half,
                                    checkpoint="checkpoint.json",
                                    replace_method='auto',
                                    replace_with_kernel_inject=True)

I can use another approach to load the checkpoint:

#Initialize the DeepSpeed-Inference engine
 model_engine, _, _, _ = deepspeed.initialize(
                                             model=model, 
                                             model_parameters=model.parameters(), 
                                             config=ds_config
                                             )

 # load checkpoint 
 load_dir = '../results/ds_models/global_step226'
 #load checkpoint
 _, client_sd = model_engine.load_checkpoint(load_dir)

and use this new model_engine for inference. I am not sure what is the difference between two methods? and why first approach is not working?

The text was updated successfully, but these errors were encountered:

mrwyattii · 2022-08-01T17:19:13Z

@Jirigesi Thanks for using DeepSpeed! I believe the problem when using init_inference is that your checkpoint.json is missing the key checkpoints:

KeyError: 'checkpoints'

Try replacing checkpoint_path with checkpoints:

{
    "type": "DeepSpeed",
    "version": 0.3,
    "checkpoints": ["./ds_models/global_step1/mp_rank_00_model_states.pt"]
}

lekurile · 2023-03-08T19:06:33Z

Hello @Jirigesi,

Apologies for the delayed follow up to your issue. The inference tutorial is slightly out of date with the code. For checkpoint loading to work using a checkpoint.json as described in the tutorial, replace_with_kernel_inject must be False due to this check in the InferenceEngine:
https://github.com/microsoft/DeepSpeed/blob/58a4a4d4c19bda86d489ac171fa10f3ddb27c9d6/deepspeed/inference/engine.py#L95
This check was added in GH-2083 along with the Meta Tensors feature, which uses "meta tensors" to initialize the model, then loads the weights after module replacement.

The GH-2940 draft PR changes the InferenceEngine check in the code snippet above to more explicitly check for meta tensor usage, allowing checkpoints to be loaded as described in the tutorial. We're also looking to update the tutorial as well to reflect the current state of checkpoint loading.

Please let us know if you have any additional questions!

Thanks,
Lev

Jirigesi added the bug Something isn't working label Jul 27, 2022

jeffra added the inference label Jul 29, 2022

awan-10 assigned mrwyattii Nov 23, 2022

jeffra assigned lekurile Dec 2, 2022

lekurile closed this as completed May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] can not initialize DeepSpeed-Inference engine with deepspeed.init_inference() #2149

[BUG] can not initialize DeepSpeed-Inference engine with deepspeed.init_inference() #2149

Jirigesi commented Jul 27, 2022

mrwyattii commented Aug 1, 2022

lekurile commented Mar 8, 2023

[BUG] can not initialize DeepSpeed-Inference engine with deepspeed.init_inference() #2149

[BUG] can not initialize DeepSpeed-Inference engine with deepspeed.init_inference() #2149

Comments

Jirigesi commented Jul 27, 2022

mrwyattii commented Aug 1, 2022

lekurile commented Mar 8, 2023