[CI] Framework and hardware-specific CI tests #997

anton-l · 2022-10-26T16:15:03Z

Now we have the following GutHub actions runners as separate machines:

docker-gpu (V100 GPU)
docker-cpu (8 Intel CPU cores, MKL-ready)
docker-tpu (v3-8 TPU)
apple-m1 (Apple Silicon VM)

This PR sorts the tests to use the appropriate runners and base docker images:

⚡ Fast/PR tests:
- Test class name doesn't start with Flax or Onnx?
  - => Run with the diffusers-pytorch-cpu image on the docker-cpu runner
  - => Run with a conda env on the apple-m1 runner
- Test class name starts with Flax?
  - => Run with the diffusers-flax-cpu image on the docker-cpu runner
- Test class name starts with Onnx?
  - => Run with the diffusers-onnxruntime-cpu image on the docker-cpu runner
🐢 Slow/Merge tests:
- Test class name doesn't start with Flax or Onnx?
  - => Run with the diffusers-pytorch-cuda image on the docker-gpu runner
- Test class name starts with Flax?
  - => Run with the diffusers-flax-tpu image on the docker-tpu runner
- Test class name starts with Onnx?
  - => Run with the diffusers-onnxruntime-cuda image on the docker-gpu runner
🎨 Examples tests:
- => Run with the diffusers-pytorch-cuda image on the docker-gpu runner

HuggingFaceDocBuilderDev · 2022-10-26T16:18:59Z

The documentation is not available anymore as the PR was closed or merged.

anton-l · 2022-11-01T00:47:36Z

.github/workflows/pr_tests.yml

+  OMP_NUM_THREADS: 4
+  MKL_NUM_THREADS: 4


The CPU runner has 8 cores => 2 pytest workers * 4 cores.
The speed isn't affected by this change (only faster due to the new docker image)

anton-l · 2022-11-01T00:48:35Z

.github/workflows/pr_tests.yml

+      matrix:
+        config:
+          - name: Fast PyTorch CPU tests on Ubuntu
+            framework: pytorch
+            runner: docker-cpu
+            image: diffusers/diffusers-pytorch-cpu
+            report: torch_cpu
+          - name: Fast Flax CPU tests on Ubuntu
+            framework: flax
+            runner: docker-cpu
+            image: diffusers/diffusers-flax-cpu
+            report: flax_cpu
+          - name: Fast ONNXRuntime CPU tests on Ubuntu
+            framework: onnxruntime
+            runner: docker-cpu
+            image: diffusers/diffusers-onnxruntime-cpu
+            report: onnx_cpu


This matrix defines the different combinations of frameworks, docker images and runners to test

anton-l · 2022-11-01T00:50:39Z

tests/pipelines/stable_diffusion/test_onnx_stable_diffusion.py

 class OnnxStableDiffusionPipelineIntegrationTests(unittest.TestCase):
    def test_inference(self):
+        provider = (
+            "CUDAExecutionProvider",
+            {
+                "gpu_mem_limit": "17179869184",  # 16GB.
+                "arena_extend_strategy": "kSameAsRequested",
+            },
+        )


Onnx tests now run with the CUDA provider. This enables us to add more integration tests without worrying about inference speed.

anton-l · 2022-11-01T00:51:24Z

tests/test_pipelines_flax.py

+        assert images.shape == (8, 1, 128, 128, 3)
+        assert np.abs(np.abs(images[0, 0, :2, :2, -2:], dtype=np.float32).sum() - 3.1111548) < 1e-3
+        assert np.abs(np.abs(images, dtype=np.float32).sum() - 199746.95) < 5e-1


Not sure why this slow (tpu) test had different values before. Updated the reference values to what I've got on the TPU runner with jax[tpu]

anton-l · 2022-11-01T00:52:30Z

tests/test_scheduler_flax.py

+        if jax_device == "tpu":
+            assert abs(result_sum - 255.0714) < 1e-2
+            assert abs(result_mean - 0.332124) < 1e-3
+        else:
+            assert abs(result_sum - 255.1113) < 1e-2
+            assert abs(result_mean - 0.332176) < 1e-3


The scheduler tests needed some adjustments for when they're running on TPU.

anton-l · 2022-11-01T00:53:43Z

tests/test_scheduler_flax.py

+        if jax_device == "tpu":
+            pass
+            # FIXME: both result_sum and result_mean are nan on TPU
+            # assert jnp.isnan(result_sum)
+            # assert jnp.isnan(result_mean)
+        else:
+            assert abs(result_sum - 149.0784) < 1e-2
+            assert abs(result_mean - 0.1941) < 1e-3


Probably not too urgent to fix

great! glad the tests are catching these things though!

anton-l · 2022-11-01T00:54:50Z

setup.py

-    "jax>=0.2.8,!=0.3.2,<=0.3.6",
-    "jaxlib>=0.1.65,<=0.3.6",


Removing the version cap from jax, as we can use the latest version now (and it's required for docker support)

anton-l · 2022-11-01T00:58:01Z

The PR is now ready for review, lmk if something needs to be explained more!

cc @muellerzr @ydshieh for optional reviews and/or inspiration :)

anton-l · 2022-11-01T01:06:42Z

Fast/PR tests:
https://github.com/huggingface/diffusers/actions/runs/3365870659

Slow/Merge tests:
https://github.com/huggingface/diffusers/actions/runs/3365766338

patrickvonplaten · 2022-11-02T12:39:37Z

Wow - super cool! Great job :-)

patrickvonplaten · 2022-11-02T12:39:57Z

Looks all good to me - happy to merge!

patrickvonplaten · 2022-11-02T12:42:56Z

.github/workflows/pr_tests.yml

+        HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
+      run: |
+        python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
+          -s -v -k "Flax" \


(nit) think it's a bit saver/easier to work with environment variables e.g. RUN_FLAX=True/False and a test decorator but ok for me for now!

Good idea, will add it soon!

patil-suraj

Very cool! Looks good to me

ydshieh · 2022-11-02T16:32:25Z

.github/workflows/pr_tests.yml

    container:
-      image: python:3.7
+      image: ${{ matrix.config.image }}
      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/


We don't need --gpus 0 or --gpus all if we want to use GPU in the docker? In transformers CI, we specified it.

Oh, this is PR tests, and only on CPU. Sorry to bother

ydshieh · 2022-11-02T16:42:14Z

.github/workflows/push_tests.yml

      env:
        HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
      run: |
-        python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v --make-reports=tests_torch_gpu tests/
+        python -m pytest -n 0 \


Do we use -n 0 to disable xdist for Flax?

Precisely! Looks like jax[tpu] doesn't like being launched with multiprocessing at all: the TPU gets reserved by the parent process and the tests can't get access to it afterwards: jax-ml/jax#10192

ydshieh · 2022-11-02T16:44:01Z

Very nice and clean usage of matrix!

* [WIP][CI] Framework and hardware-specific docker images for CI tests * username * fix cpu * try out the image * push latest * update workspace * no root isolation for actions * add a flax image * flax and onnx matrix * fix runners * add reports * onnxruntime image * retry tpu * fix * fix * build onnxruntime * naming * onnxruntime-gpu image * onnxruntime-gpu image, slow tests * latest jax version * trigger flax * run flax tests in one thread * fast flax tests on cpu * fast flax tests on cpu * trigger slow tests * rebuild torch cuda * force cuda provider * fix onnxruntime tests * trigger slow * don't specify gpu for tpu * optimize * memory limit * fix flax tests * disable docker cache

[WIP][CI] Framework and hardware-specific docker images for CI tests

b43319a

anton-l added 28 commits October 26, 2022 18:25

username

3247464

fix cpu

f796f2b

try out the image

b30fadd

push latest

ff02418

update workspace

eaeadab

no root isolation for actions

d463c79

add a flax image

9148936

flax and onnx matrix

54d9357

fix runners

9f9ae16

add reports

24420c1

onnxruntime image

f4fdf5c

retry tpu

c3c03bd

fix

b5821a4

fix

adede47

build onnxruntime

0c5cc43

naming

a6c4f31

onnxruntime-gpu image

45bb7be

Merge remote-tracking branch 'origin/main' into ci-docker-images

3a644b6

onnxruntime-gpu image, slow tests

6c8bc3e

Merge main

f3ac32f

latest jax version

a62cdd1

trigger flax

85ce44b

run flax tests in one thread

2b03693

fast flax tests on cpu

948b666

fast flax tests on cpu

99bfc51

trigger slow tests

7436fd8

rebuild torch cuda

cbc03a4

force cuda provider

0b7e57b

anton-l added 4 commits October 31, 2022 23:28

optimize

735f4ee

memory limit

c5ffe37

fix flax tests

c4e8dd6

disable docker cache

cf7c438

anton-l commented Nov 1, 2022

View reviewed changes

anton-l requested review from patrickvonplaten, pcuenca, patil-suraj and kashif November 1, 2022 00:55

anton-l changed the title ~~[WIP][CI] Framework and hardware-specific docker images for CI tests~~ [CI] Framework and hardware-specific docker images for CI tests Nov 1, 2022

anton-l changed the title ~~[CI] Framework and hardware-specific docker images for CI tests~~ [CI] Framework and hardware-specific CI tests Nov 1, 2022

patrickvonplaten approved these changes Nov 2, 2022

View reviewed changes

patil-suraj approved these changes Nov 2, 2022

View reviewed changes

anton-l merged commit 4e59bcc into main Nov 2, 2022

anton-l deleted the ci-docker-images branch November 2, 2022 13:07

ydshieh reviewed Nov 2, 2022

View reviewed changes

[CI] Framework and hardware-specific CI tests #997

[CI] Framework and hardware-specific CI tests #997

Uh oh!

Conversation

anton-l commented Oct 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anton-l Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anton-l Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anton-l Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anton-l commented Nov 1, 2022

Uh oh!

anton-l commented Nov 1, 2022

Uh oh!

patrickvonplaten commented Nov 2, 2022

Uh oh!

patrickvonplaten commented Nov 2, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patil-suraj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ydshieh commented Nov 2, 2022

Uh oh!

Uh oh!

anton-l commented Oct 26, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 26, 2022 •

edited

Loading

anton-l Nov 1, 2022 •

edited

Loading

anton-l Nov 1, 2022 •

edited

Loading

anton-l Nov 1, 2022 •

edited

Loading