Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Training Job Error for ray_xgboost_gpu.ipynb #10

Open
smart-patrol opened this issue May 19, 2023 · 0 comments
Open

Training Job Error for ray_xgboost_gpu.ipynb #10

smart-patrol opened this issue May 19, 2023 · 0 comments

Comments

@smart-patrol
Copy link
Contributor

I am getting the following error when trying to run the SM NB.

UnexpectedStatusException: Error for Training job pytorch-training-2023-05-19-19-07-59-014: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 train_xgboost_airline.py"
2023-05-19 19:13:50,343	INFO worker.py:1432 -- Connecting to existing Ray cluster at address: 10.2.116.47:9339...
2023-05-19 19:13:50,364	INFO worker.py:1625 -- Connected to Ray cluster.
Traceback (most recent call last):
  File "train_xgboost_airline.py", line 125, in <module>
    main()
  File "train_xgboost_airline.py", line 107, in main
    evals=[(dtrain, "train"), (dval, "val")])
  File "/opt/conda/lib/python3.6/site-packages/xgboost_ray/main.py", line 1565, in train
    placement_strategy,
  File "/opt/conda/lib/python3.6/site-packages/xgboost_ray/main.py", line 959, in _create_placement_group
    f"Placement group creation timed out after {timeout} seconds. "
TimeoutError: Placement group creation timed out after 100 seconds. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'node:10.2.116.47': 0.98, 'memory': 39562652059.0, 'CPU': 16.0, 'acc
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant