Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

RuntimeError: synStatus=26 [Generic failure] Device acquire failed. #1611

Open
2 of 4 tasks
VinayHN1365466 opened this issue Dec 16, 2024 · 15 comments
Open
2 of 4 tasks
Labels
bug Something isn't working

Comments

@VinayHN1365466
Copy link

System Info

HL-SMI Version:hl-1.18.0-fw-53.1.1.1
Driver Version:1.18.0-ee698fb 
Docker: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest

I'm getting the below error while running text-generation file. 
python run_generation.py --model_name_or_path gpt2 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --prompt "Here is my prompt"



/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Fetching 1 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6898.53it/s]
Fetching 1 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4760.84it/s]
12/16/2024 08:28:10 - INFO - __main__ - Single-device run.
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi_common/hcl_device_control_factory.cpp::84(initDevice): The condition [ g_ibv.init(deviceConfig) == hcclSuccess ] failed. ibv initialization failed
Traceback (most recent call last):
  File "/root/optimum-habana/examples/text-generation/run_generation.py", line 773, in <module>
    main()
  File "/root/optimum-habana/examples/text-generation/run_generation.py", line 384, in main
    model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
  File "/root/optimum-habana/examples/text-generation/utils.py", line 720, in initialize_model
    setup_model(args, model_dtype, model_kwargs, logger)
  File "/root/optimum-habana/examples/text-generation/utils.py", line 297, in setup_model
    model = model.eval().to(args.device)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2958, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1177, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 805, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1163, in convert
    return t.to(
RuntimeError: synStatus=26 [Generic failure] Device acquire failed.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. cd examples/text-generation/
  2. python run_generation.py --model_name_or_path gpt2 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --prompt "Here is my prompt"

Expected behavior

Execute Successfully

@VinayHN1365466 VinayHN1365466 added the bug Something isn't working label Dec 16, 2024
@regisss
Copy link
Collaborator

regisss commented Dec 16, 2024

@VinayHN1365466 It looks like your devices are already busy or are somehow unavailable. Can you run hl-smi and paste the output here please?

@VinayHN1365466
Copy link
Author

no process are running

@VinayHN1365466
Copy link
Author

image

@regisss
Copy link
Collaborator

regisss commented Dec 16, 2024

Can you try adding --privileged to your docker run command?

@VinayHN1365466
Copy link
Author

VinayHN1365466 commented Dec 16, 2024

Thanks Regisss, I tried with --privileged with Docker, its still the same error

docker run --privileged -it --name optimum_118_8cards_vinay_new_1234 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host -v /mode_file/:/root/.cache/ -v /optimum-habana:/root/optimum-habana vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:lates

image

@regisss
Copy link
Collaborator

regisss commented Dec 16, 2024

Can you paste here the complete logs you're getting?

@VinayHN1365466
Copy link
Author

~/optimum-habana/examples/text-generation# python run_generation.py
--model_name_or_path gpt2
--use_hpu_graphs
--use_kv_cache
--max_new_tokens 100
--do_sample
--prompt "Here is my prompt"
/usr/lib/python3.10/inspect.py:288: FutureWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
return isinstance(object, types.FunctionType)
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
Fetching 1 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14665.40it/s]
Fetching 1 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6316.72it/s]
12/16/2024 09:09:06 - INFO - main - Single-device run.
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi_common/hcl_device_control_factory.cpp::84(initDevice): The condition [ g_ibv.init(deviceConfig) == hcclSuccess ] failed. ibv initialization failed
Traceback (most recent call last):
File "/root/optimum-habana/examples/text-generation/run_generation.py", line 773, in
main()
File "/root/optimum-habana/examples/text-generation/run_generation.py", line 384, in main
model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
File "/root/optimum-habana/examples/text-generation/utils.py", line 720, in initialize_model
setup_model(args, model_dtype, model_kwargs, logger)
File "/root/optimum-habana/examples/text-generation/utils.py", line 297, in setup_model
model = model.eval().to(args.device)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2958, in to
return super().to(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1177, in to
return self._apply(convert)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 805, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1163, in convert
return t.to(
RuntimeError: synStatus=26 [Generic failure] Device acquire failed.

@regisss
Copy link
Collaborator

regisss commented Dec 16, 2024

Does running

import torch
import habana_frameworks.torch.hpu

a = torch.tensor(1, device="hpu")

work?

@VinayHN1365466
Copy link
Author

I got the same error
/usr/lib/python3.10/inspect.py:288: FutureWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
return isinstance(object, types.FunctionType)
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi_common/hcl_device_control_factory.cpp::84(initDevice): The condition [ g_ibv.init(deviceConfig) == hcclSuccess ] failed. ibv initialization failed
Traceback (most recent call last):
File "/root/optimum-habana/examples/text-generation/sample.py", line 4, in
a = torch.tensor(1, device="hpu")
RuntimeError: synStatus=26 [Generic failure] Device acquire failed.

@regisss
Copy link
Collaborator

regisss commented Dec 16, 2024

Can you reboot this instance?

@VinayHN1365466
Copy link
Author

Sorry, I don't have access to reboot the instance :(

@libinta
Copy link
Collaborator

libinta commented Dec 16, 2024

@VinayHN1365466 can you capture dmesg -T ? thanks.

@yuanwu2017
Copy link
Contributor

no process are running

On some cloud machine, you need to add sudo to watch all process of other users.

@VinayHN1365466
Copy link
Author

image

@VinayHN1365466
Copy link
Author

I rebooted the instance, but its still the same issue :(

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants