Skip to content

pthread_exit crashes on threads created by std::thread::spawn in 1.84, not 1.83, breaking pyo3-log #135929

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
arielb1 opened this issue Jan 23, 2025 · 16 comments
Labels
A-thread Area: `std::thread` C-bug Category: This is a bug. S-has-bisection Status: a bisection has been found for this issue S-has-mcve Status: A Minimal Complete and Verifiable Example has been found for this issue T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@arielb1
Copy link
Contributor

arielb1 commented Jan 23, 2025

Meta

Tested on Ubuntu 24.04 and Amazon Linux 2, x86_64.

Workaround to the production problem

In your Cargo.toml that is compiling your Python module, set

[profile.release]
debug = 0
lto = false

This prevents the 1.84 crashes.

However, there is still UB going on even with this setting.

The Production Problem

The crate pyo3-log installs a bridge that makes log functions call into Python. This means that all calls to logging::info! etc will take the GIL.

Python has a stage during interpreter shutdown where attempts to take the GIL will cause a pthread_exit. Python 3.14 (still Alpha today, targeted to be released by the end of this year) will change this in python/cpython#87135 - but that will take some time to reach people.

This means that if you have a Python program that uses a Rust library and pyo3-log, that spawning a Rust thread, that is calling logging::info! in a way unsynchronized with interpreter exit, you'll have unpredicatable crashes in 1.84.

Minified Program

This program:

use std::ffi::c_void;

extern "C" {
    fn pthread_exit(retval: *const c_void);
}
fn main() {
    std::thread::spawn(|| {
        unsafe { pthread_exit(std::ptr::null()); }
    });
    std::thread::sleep(std::time::Duration::from_secs(1));
}

when compiled with the following options

rustc +1.84 d.rs -Cpanic=abort -Cdebuginfo=limited

crashes with this confusing error

thread '<unnamed>' panicked at core/src/panicking.rs:223:5:
panic in a function that cannot unwind
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/std/src/panicking.rs:665:5
   1: core::panicking::panic_nounwind_fmt::runtime
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/core/src/panicking.rs:119:22
   2: core::panicking::panic_nounwind_fmt
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/core/src/intrinsics/mod.rs:3535:9
   3: core::panicking::panic_nounwind
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/core/src/panicking.rs:223:5
   4: core::panicking::panic_cannot_unwind
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/core/src/panicking.rs:315:5
   5: std::sys::pal::unix::thread::Thread::new::thread_start
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/std/src/sys/pal/unix/thread.rs:99:9
   6: start_thread
   7: clone
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread caused non-unwinding panic. aborting.
Aborted

This crash happens:

  1. Only on 1.84, not on 1.83
  2. Only when debuginfo is enabled, but even if the binary is stripped.

When using -C panic=unwind instead, on all versions of the compiler, you get this error:

FATAL: exception not rethrown
Aborted (core dumped)

I seen the claim in Zulip (https://rust-lang.zulipchat.com/#narrow/channel/122651-general/topic/pthread_exit.20from.20a.20Rust-spawned.20thread) that this is undefined behavior, but I'll rather not break pyo3-log

@arielb1 arielb1 added the C-bug Category: This is a bug. label Jan 23, 2025
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jan 23, 2025
@jieyouxu jieyouxu added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Jan 23, 2025
@arielb1
Copy link
Contributor Author

arielb1 commented Jan 23, 2025

See #129582 and #74990

@jieyouxu jieyouxu added A-thread Area: `std::thread` T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. and removed T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jan 23, 2025
@jieyouxu
Copy link
Member

Relevant discussions: rust-lang/unsafe-code-guidelines#404

@arielb1
Copy link
Contributor Author

arielb1 commented Jan 23, 2025

Even if it's bad, I believe we need to have a working way to write Python libraries in Rust, even if the libraries spawn Rust threads that can potentially call into Python. debuginfo = 0 is a working way, but it feels too ugly.

The other argument is that Rust 1.84 has the MSRV-aware resolver, so e.g. reverting the change just for 1.84, leaving people stuck in Rust 1.84, is much better than leaving them stuck in 1.83.

I believe it would be possible to make pyo3-log marshal everything to a Python thread, but that would not work if there are non-pyo3-log causes of the problem.

@arielb1
Copy link
Contributor Author

arielb1 commented Jan 23, 2025

cc @vorner for the pyo3-log problem

@Noratrieb Noratrieb added E-needs-bisection Call for participation: This issue needs bisection: https://github.com/rust-lang/cargo-bisect-rustc S-has-mcve Status: A Minimal Complete and Verifiable Example has been found for this issue and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Jan 23, 2025
@arielb1
Copy link
Contributor Author

arielb1 commented Jan 23, 2025

As expected:
Bad/regressed (new) commit: 06d261d Auto merge of #129582 - nbdd0121:unwind, r=nnethercote
Good (old) commit: dd51276 Auto merge of #131796 - cuviper:no-waker-waker, r=Mark-Simulacrum

@jieyouxu jieyouxu added S-has-bisection Status: a bisection has been found for this issue and removed E-needs-bisection Call for participation: This issue needs bisection: https://github.com/rust-lang/cargo-bisect-rustc labels Jan 23, 2025
@arielb1
Copy link
Contributor Author

arielb1 commented Jan 23, 2025

A solution is also possible in pyo3:

Create C++ code that does

PyGILState_STATE safe_gil_ensure() {
    try {
        return PyGILState_Ensure();
    } catch(...) {
        for(;;) pause();
    }
}

and replace the calls to PyGILState_Ensure in pyo3 with it. It's basically the Python 3.14 solution but implemented within pyo3

@nikomatsakis
Copy link
Contributor

cc @davidhewitt to discuss best resolution here

@davidhewitt
Copy link
Contributor

See also vorner/pyo3-log#30 and PyO3/pyo3#2102 (@arielb1 I see you have already commented on the pyo3-log issue).

It seems to me that getting a fix into PyO3 is the most reasonable approach.

The proposed C++ snippet... seems fine, but is also somewhat heavy. I would imagine that requires adding cc as a build dependency of pyo3-ffi, and requiring a working C++ toolchain for users compiling PyO3 libraries. Not the end of the world of course.

I don't see exactly which change in 1.84 is responsible for the regression, is it well understood? I also think this has always been UB, and 1.84 didn't change that?

See also PyO3/pyo3#3386 - really what I think needs to happen is there needs to be a way to attempt to acquire the GIL which can fail gracefully. pyo3-log should then switch to using that. This probably needs upstream CPython API but has been out of my capacity to propose recently.

@arielb1
Copy link
Contributor Author

arielb1 commented Jan 23, 2025

I don't see exactly which change in 1.84 is responsible for the regression, is it well understood? I also think this has always been UB, and 1.84 didn't change that?

As far as I can tell it has always been UB. The change in #129582 had made it crash with panic=abort, before that it also crashed with panic=unwind.

As far as I can tell Python has no non-racy way of fallibly acquiring the GIL. I think that doing the Python 3.14 thing is good enough.

The proposed C++ snippet... seems fine, but is also somewhat heavy. I would imagine that requires adding cc as a build dependency of pyo3-ffi, and requiring a working C++ toolchain for users compiling PyO3 libraries. Not the end of the world of course.

Sure. But I can't think of an easier non-racy way of doing things.

@arielb1
Copy link
Contributor Author

arielb1 commented Jan 24, 2025

Is any of you intending to do the pyo3 C++ fix, or should I have a go at doing it myself?

@arielb1
Copy link
Contributor Author

arielb1 commented Jan 26, 2025

Minimal example using pyo3:

Python:

import testmod

def makeloop():
  start = []
  cur = [start]
  for _ in range(1000*1000*10):
      cur = [cur]
  start.append(cur)

makeloop()

Rust:

use pyo3::prelude::*;

/// A Python module implemented in Rust.
#[pymodule]
fn testmod(_m: &Bound<'_, PyModule>) -> PyResult<()> {
    std::thread::spawn(|| {
        loop {
            Python::with_gil(|_py| ());
        }
    });
    Ok(())
}

@arielb1
Copy link
Contributor Author

arielb1 commented Jan 27, 2025

@davidhewitt what do you think of PyO3/pyo3#4874 ?

@davidhewitt
Copy link
Contributor

Thanks, I've seen you've pushed that, I'm very short of time at the moment so please allow me a few days to find a moment to sit down and review fully.

@arielb1
Copy link
Contributor Author

arielb1 commented Jan 28, 2025

Cool.

github-merge-queue bot pushed a commit to PyO3/pyo3 that referenced this issue Feb 19, 2025
* hang instead of pthread_exit during interpreter shutdown

see python/cpython#87135 and
rust-lang/rust#135929

* relnotes

* fix warnings

* version using pthread_cleanup_push

* add tests

* new attempt

* clippy

* comment

* msrv

* address review comments

* update comment

* add comment

---------

Co-authored-by: Ariel Ben-Yehuda <arielby@amazon.com>
github-merge-queue bot pushed a commit to PyO3/pyo3 that referenced this issue Mar 9, 2025
* hang instead of pthread_exit during interpreter shutdown

see python/cpython#87135 and
rust-lang/rust#135929

* relnotes

* fix warnings

* version using pthread_cleanup_push

* add tests

* new attempt

* clippy

* comment

* msrv

* address review comments

* update comment

* add comment

* try to skip test on debug builds

---------

Co-authored-by: Ariel Ben-Yehuda <arielby@amazon.com>
Co-authored-by: David Hewitt <mail@davidhewitt.dev>
@davidhewitt
Copy link
Contributor

We handled a fix in PyO3 side for this, should we close this issue?

@arielb1
Copy link
Contributor Author

arielb1 commented Apr 8, 2025

Yea, closing

@arielb1 arielb1 closed this as completed Apr 8, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-thread Area: `std::thread` C-bug Category: This is a bug. S-has-bisection Status: a bisection has been found for this issue S-has-mcve Status: A Minimal Complete and Verifiable Example has been found for this issue T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants