-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[AWS] Support EFA for P5/P5e instances #4062
Conversation
Thanks for this! A few issues:
Edit: it appears the first node was set up correctly but the second was not? It seems like the rsync command to move the skypilot runtime over is failing but after a few hours I can't figure out why. Most verbose output I could get: https://gist.github.com/zaptrem/9da876d43963f118a587ea9eb030d812
|
@Michaelvll apologies for the double ping. Do you know what may be causing the issue with rclone above? I'd like to try this with our training runs this week but unfortunately haven't yet cracked the above error. |
This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This PR was closed because it has been stalled for 10 days with no activity. |
To enable EFA:
sky launch --gpus A100:8 --num-nodes 2 -c test-efa
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh