Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

GPU CI Fix (Pin runs-on GPU image) #1588

Merged
merged 30 commits into from
Jan 31, 2025
Merged

Conversation

lockshaw
Copy link
Collaborator

@lockshaw lockshaw commented Jan 30, 2025

Description of changes:

  • pin runs-on image to prevent random breakage in the future
  • place /nix into AWS ephemeral storage like the rest of the job files to prevent out-of-space issues.

Related Issues:

Linked Issues:

  • Issue #

Issues closed by this PR:

  • Closes #

This change is Reviewable

Copy link

codecov bot commented Jan 30, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.94%. Comparing base (41d2fb5) to head (9dfbc24).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1588   +/-   ##
=======================================
  Coverage   60.94%   60.94%           
=======================================
  Files         618      618           
  Lines       14978    14978           
=======================================
  Hits         9129     9129           
  Misses       5849     5849           
Flag Coverage Δ
unittests 60.94% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

@lockshaw lockshaw marked this pull request as ready for review January 30, 2025 23:43
@lockshaw lockshaw requested a review from chenzhuofu January 30, 2025 23:43
@lockshaw lockshaw enabled auto-merge (squash) January 30, 2025 23:54
@lockshaw lockshaw disabled auto-merge January 30, 2025 23:54
@lockshaw lockshaw changed the title GPU CI Fix GPU CI Fix (Pin runs-on GPU image) Jan 30, 2025
@lockshaw lockshaw enabled auto-merge (squash) January 30, 2025 23:55
@lockshaw lockshaw merged commit 4d3294a into flexflow:master Jan 31, 2025
5 of 6 checks passed
oOTigger pushed a commit to oOTigger/FlexFlow that referenced this pull request Feb 5, 2025
* Debug

* Change to base DL AMI

* Print disk usage

* Run nvidia-smi

* Remove excess cuda installs in base ami

* Re-enable freeing space in GPU CI

* Try updating nix-develop version

* Check what happens if you just enter the non-nixGL environment

* Try switching AMIs

* Try to remove the module stuff

* Move to lockshaw/develop-action

* Try pointing at a fixed commit

* Update nix-develop action

* Update nix-develop action to use BASH_FUNC filtering

* Remove all the /usr/local/cuda entries

* Switch back to gpu-ci env

* Update the cuda arch

* Try out the new runs-on gpu image

* Move over to pinned runs-on image

* Remove a bunch more unnecessary stuff in image to get back disk space

* Try using an emphemeral store

* Try mounting

* Fix bug

* Try sudo

* Move nix into _work

* Rollback all unnecessary changes

* Re-enable waiting on cpu-ci
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants