Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

qemu-aarch64 nondeterministic crashes #188

Closed
3 tasks done
lts-rad opened this issue Jan 5, 2025 · 9 comments
Closed
3 tasks done

qemu-aarch64 nondeterministic crashes #188

lts-rad opened this issue Jan 5, 2025 · 9 comments

Comments

@lts-rad
Copy link

lts-rad commented Jan 5, 2025

Contributing guidelines

I've found a bug, and:

  • The documentation does not mention anything about my problem
  • There are no open or closed issues that are related to my problem

Description

wanted to report that i recently switched a build pipeline running on ubuntu to the upstream ubuntu qemu-user-static. the upstream ubuntu qemu is further along on version number than this action btw, so that this saves people time.

we were witnessing a problem where qemu-aarch64 installed by this action would crash nondeterministically on github's x86-64 runners as well as on self-hosted runners using 12th gen and 13th gen intel systems. we dont have time to triage and diagnose the issue, because it's non deterministic its not clear what the underlying cause is.

Expected behaviour

qemu-aarch64 doesnt crash at random

Actual behaviour

qemu-aarch64 crashed at random

Repository URL

No response

Workflow run URL

No response

YAML workflow

n/a

Workflow logs

No response

BuildKit logs


Additional info

No response

@smoke
Copy link

smoke commented Jan 20, 2025

We have a Ruby app depends on quite a bunch of "native" extensions packages - that are packages having to be built through gcc

Targeting linux/amd64,linux/arm64

we have consistent Gem::Ext::BuildError: ERROR: Failed to build gem native extension. caused by gcc: internal compiler error: Segmentation fault signal terminated program cc1
but that will affect anything built by gcc or similar (nodejs packages, etc.).

Similar thing happened with a Go project needed some stuff built with gcc.

I was suspecting this issue, but it turned out another action int128/create-ecr-repository-action@v1.342.0 is rogue, I am still puzzled how it could brake stuff in such way int128/create-ecr-repository-action@v1.342.0...v1.348.0

UPDATE: Still investigating, no clarity ... Issue is happening sporadically, combined with caching it is really difficult to get a good grip on what is the cause and what is a fix.

@lts-rad
Copy link
Author

lts-rad commented Jan 20, 2025

the qemu that is shipped by this action seemed broken the last time i used it. i recommend trying to use your base images qemu instead.

@smoke
Copy link

smoke commented Jan 21, 2025

Indeed there are some very strange and sporadic qemu@v7 issues in combination with some other unidentified changes / inconsistencies (buildx, gh-actions, runners or whatever).
I have managed to reproduce locally.

Qemu v7 used to work well for quite some time but not anymore, since ~1-2 weeks it started resulting in sporadic but stubborn segmentation faults in gcc and other executables.

Having build cache lead this to be very difficult to identify and fix, so I am still not sure how stable the fix is, but
what seems to fix it so far is using qemu@v8 through:

      uses: docker/setup-qemu-action@v3.3.0
      with:
        image: tonistiigi/binfmt:qemu-v8.1.5
        platforms: ${{ inputs.platforms }}

@crazy-max
Copy link
Member

As this is qemu related can you open an issue on https://github.com/tonistiigi/binfmt? Thanks

@smoke
Copy link

smoke commented Jan 21, 2025

there is related ticket already tonistiigi/binfmt#215

@mayeut
Copy link

mayeut commented Jan 25, 2025

This does not only affect aarch64 running on x86_64 but also at least s390x & ppc64le.
All repos I have seen using setup-qemu-action with ubuntu-latest/ubuntu-24.04 started failing with the update of the runner this week.

@crazy-max, I understand the issue is upstream so not asking to re-open the issue but could it be at least pinned in order for users of the action to easily find the issue and workaround while waiting for a fix upstream ?

For aarch64, another solution is to switch from emulated to native with the availability of ubuntu-24.04-arm runners: https://github.blog/changelog/2025-01-16-linux-arm64-hosted-runners-now-available-for-free-in-public-repositories-public-preview/

@lts-rad
Copy link
Author

lts-rad commented Jan 25, 2025

@mayeut or anyone else would you have an example for how to use a runner to handle an architecture specific part of a docker bake?

cagnulein added a commit to cagnulein/qdomyos-zwift that referenced this issue Jan 25, 2025
josephleekl added a commit to PennyLaneAI/pennylane-lightning that referenced this issue Jan 29, 2025
### Before submitting

Please complete the following checklist when submitting a PR:

- [ ] All new features must include a unit test.
If you've fixed a bug or added code that should be tested, add a test to
the
      [`tests`](../tests) directory!

- [ ] All new functions and code must be clearly commented and
documented.
If you do make documentation changes, make sure that the docs build and
      render correctly by running `make docs`.

- [ ] Ensure that the test suite passes, by running `make test`.

- [x] Add a new entry to the `.github/CHANGELOG.md` file, summarizing
the
      change, and including a link back to the PR.

- [ ] Ensure that code is properly formatted by running `make format`. 

When all the above are checked, delete everything above the dashed
line and fill in the pull request template.


------------------------------------------------------------------------------------------------------------

**Context:**
The [aarch64 wheel build CI action has been
failing](https://github.com/PennyLaneAI/pennylane-lightning/actions/workflows/wheel_linux_aarch64.yml)
since circa 24 Jan 2025. They fail with a segmentation fault during the
CIBW process. This has also been observed for similar wheel builds with
QEMU with other repositories:
docker/setup-qemu-action#188
ssciwr/clang-format-wheel#124
tonistiigi/binfmt#215
tonistiigi/binfmt#165
and fix attempt:
ssciwr/clang-format-wheel#125

It is due to using an old version (v7) of qemu that comes with binfmt.
`setup-qemu-action` by default uses `binfmt:latest` image which has not
been updated in 2 years.

**Description of the Change:**
Use a newer QEMU image (v8) from binfmt.

**Benefits:**
aarch64 wheel builds will succeed again,
[e.g.](https://github.com/PennyLaneAI/pennylane-lightning/actions/runs/13019772888?pr=1056)
**Possible Drawbacks:**

**Related GitHub Issues:**

[sc-83297]

---------

Co-authored-by: ringo-but-quantum <github-ringo-but-quantum@xanadu.ai>
Co-authored-by: Ali Asadi <10773383+maliasadi@users.noreply.github.com>
theihor added a commit to theihor/bpfci-runner that referenced this issue Jan 31, 2025
See: docker/setup-qemu-action#188 (comment)
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
theihor added a commit to theihor/bpfci-runner that referenced this issue Jan 31, 2025
See: docker/setup-qemu-action#188 (comment)
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
@theihor
Copy link

theihor commented Jan 31, 2025

Indeed there are some very strange and sporadic qemu@v7 issues in combination with some other unidentified changes / inconsistencies (buildx, gh-actions, runners or whatever). I have managed to reproduce locally.

Qemu v7 used to work well for quite some time but not anymore, since ~1-2 weeks it started resulting in sporadic but stubborn segmentation faults in gcc and other executables.

Having build cache lead this to be very difficult to identify and fix, so I am still not sure how stable the fix is, but what seems to fix it so far is using qemu@v8 through:

      uses: docker/setup-qemu-action@v3.3.0
      with:
        image: tonistiigi/binfmt:qemu-v8.1.5
        platforms: ${{ inputs.platforms }}

@smoke Hi there. Just sharing a data point that specifying more recent qemu version helped in my case. Thanks a lot!

For Linux Kernel BPF CI, Github Actions runner docker images for arm64 and s390x are built with docker/build-and-push action. Recently the builds started to fail on apt install software-properties-common with a python error:

Setting up python3-lazr.restfulclient (0.14.6-1) ...
2025-01-31T00:45:50.8910192Z #6 380.4 Traceback (most recent call last):
2025-01-31T00:45:51.0794488Z #6 380.4   File "/usr/bin/py3compile", line 323, in <module>
2025-01-31T00:45:51.0796688Z #6 380.4     main()
2025-01-31T00:45:51.0797106Z #6 380.4   File "/usr/bin/py3compile", line 302, in main
2025-01-31T00:45:51.0797592Z #6 380.4     compile(files, versions,
2025-01-31T00:45:51.0798096Z #6 380.4   File "/usr/bin/py3compile", line 203, in compile
2025-01-31T00:45:51.0798626Z #6 380.4     interpreter.magic_number(version),
2025-01-31T00:45:51.0799132Z #6 380.4     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-31T00:45:51.0799813Z #6 380.4   File "/usr/share/python3/debpython/interpreter.py", line 233, in magic_number
2025-01-31T00:45:51.0800624Z #6 380.4     result = self._execute('import importlib.util; print(importlib.util.MAGIC_NUMBER)', version)
2025-01-31T00:45:51.0801455Z #6 380.4              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-31T00:45:51.0801983Z #6 380.4   File "/usr/share/python3/debpython/interpreter.py", line 359, in _execute
2025-01-31T00:45:51.0802628Z #6 380.4     raise Exception('{} failed with status code {}'.format(command, output['returncode']))
2025-01-31T00:45:51.0803373Z #6 380.4 Exception: ('python3.12', '-c', 'import importlib.util; print(importlib.util.MAGIC_NUMBER)') failed with status code -11

I also saw .NET segfaulting when building an app for s390x in similar environment, might be also related.

theihor added a commit to kernel-patches/runner that referenced this issue Jan 31, 2025
See: docker/setup-qemu-action#188 (comment)
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
theihor added a commit to libbpf/libbpf that referenced this issue Jan 31, 2025
The run-on-arch action is using docker to run things on various
platforms. Docker in turn relies on qemu-user-static installed on the
system.

Recently there were various reports about multi-arch docker builds
failing with seemingly random issues, and it appears to boil down to
qemu [1]. I stumbled on this problem while updating s390x runners [2],
and setting up more recent version of qemu helped.

[1] docker/setup-qemu-action#188
[2] kernel-patches/runner#69

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
theihor added a commit to libbpf/libbpf that referenced this issue Jan 31, 2025
The run-on-arch action is using docker to run things on various
platforms. Docker in turn relies on qemu-user-static installed on the
system.

Recently there were various reports about multi-arch docker builds
failing with seemingly random issues, and it appears to boil down to
qemu [1]. I stumbled on this problem while updating s390x runners [2],
and setting up more recent version of qemu helped.

Install docker buildx and enforce it's usage with DOCKER_BUILDKIT=1
[3], so that it's used by run-on-arch action.

[1] docker/setup-qemu-action#188
[2] kernel-patches/runner#69
[3] https://docs.docker.com/build/buildkit/#getting-started

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
theihor added a commit to theihor/libbpf that referenced this issue Jan 31, 2025
run-on-arch-action is simply a wrapper around docker. There is no
value in using it in libbpf, as it is not complicated to run
non-native arch docker images directly on github-hosted runners.

Docker relies on qemu-user-static installed on the system to emulate
different architectures.

Recently there were various reports about multi-arch docker builds
failing with seemingly random issues, and it appears to boil down to
qemu [1]. I stumbled on this problem while updating s390x runners [2]
for BPF CI, and setting up more recent version of qemu helped.

This change addresses recent build failures on s390x and ppc64le.

[1] docker/setup-qemu-action#188
[2] kernel-patches/runner#69
[3] https://docs.docker.com/build/buildkit/#getting-started

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
anakryiko pushed a commit to libbpf/libbpf that referenced this issue Feb 1, 2025
run-on-arch-action is simply a wrapper around docker. There is no
value in using it in libbpf, as it is not complicated to run
non-native arch docker images directly on github-hosted runners.

Docker relies on qemu-user-static installed on the system to emulate
different architectures.

Recently there were various reports about multi-arch docker builds
failing with seemingly random issues, and it appears to boil down to
qemu [1]. I stumbled on this problem while updating s390x runners [2]
for BPF CI, and setting up more recent version of qemu helped.

This change addresses recent build failures on s390x and ppc64le.

[1] docker/setup-qemu-action#188
[2] kernel-patches/runner#69
[3] https://docs.docker.com/build/buildkit/#getting-started

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
robert-scheck added a commit to openbgpd-portable/openbgpd-container that referenced this issue Feb 6, 2025
robert-scheck added a commit to rpki-client/rpki-client-container that referenced this issue Feb 7, 2025
philn added a commit to philn/glide that referenced this issue Feb 8, 2025
cagnulein added a commit to cagnulein/qdomyos-zwift that referenced this issue Feb 8, 2025
* Update main.yml

* Update main.yml

* Update main.yml

docker/setup-qemu-action#188

* Update main.yml

* Update main.yml

* Update main.yml

* Update main.yml

* Update main.yml

* Update main.yml

* Update main.yml

* Update main.yml

* Update main.yml

* Update main.yml

* Update main.yml

* Update main.yml

* Update main.yml
@bettio
Copy link

bettio commented Feb 10, 2025

I'm sorry for this comment on a closed issue. QEMU v8.1.5 seem to not address the problem with arm64v8, at least for the run I did. It keeps segfaulting when running some commands.
At the end reverting to ubuntu 22.04 fixed the issue, but it doesn't sound like a long term solution.

bettio added a commit to atomvm/AtomVM that referenced this issue Feb 10, 2025
Run qemu on Ubuntu 22.04

See also:
- actions/runner-images#11471
- docker/setup-qemu-action#188
- docker/setup-qemu-action#198

Upgrading to QEMU v8.1.5 doesn't seem to help, so
closes #1529

I runt the CI multiple times and it always worked, so I think this downgrade really "fixes" the issue.

These changes are made under both the "Apache 2.0" and the "GNU Lesser General
Public License 2.1 or later" license terms (dual license).

SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later
scoder added a commit to scoder/lupa that referenced this issue Feb 15, 2025
the80srobot pushed a commit to wowsignal-io/libbpf that referenced this issue Mar 5, 2025
run-on-arch-action is simply a wrapper around docker. There is no
value in using it in libbpf, as it is not complicated to run
non-native arch docker images directly on github-hosted runners.

Docker relies on qemu-user-static installed on the system to emulate
different architectures.

Recently there were various reports about multi-arch docker builds
failing with seemingly random issues, and it appears to boil down to
qemu [1]. I stumbled on this problem while updating s390x runners [2]
for BPF CI, and setting up more recent version of qemu helped.

This change addresses recent build failures on s390x and ppc64le.

[1] docker/setup-qemu-action#188
[2] kernel-patches/runner#69
[3] https://docs.docker.com/build/buildkit/#getting-started

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

6 participants