Skip to content

park_timeout not waking up after timeout on macs #59020

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
lespea opened this issue Mar 8, 2019 · 6 comments · Fixed by #59676
Closed

park_timeout not waking up after timeout on macs #59020

lespea opened this issue Mar 8, 2019 · 6 comments · Fixed by #59676
Labels
C-bug Category: This is a bug. O-macos Operating system: macOS T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@lespea
Copy link

lespea commented Mar 8, 2019

I was trying to troubleshoot this issue: jonhoo/bus#18 and eventually we figured out that on my machine park_timeout wasn't waking up. I wrote this simple test code which demonstrates the issue:

use std::thread;
use std::time::Duration;

fn main() {
    let t1 = thread::spawn(|| {
        let sleep = Duration::new(0,100_000);
        for _ in 0..100 {
            println!("Parking1");
            thread::park_timeout(sleep);
        }
    });

    let t2 = thread::spawn(|| {
        let sleep = Duration::new(0,100_000);
        for _ in 0..100 {
            println!("Parking2");
            thread::park_timeout(sleep);
        }
    });

    t1.join().expect("Couldn't join thread 1");
    t2.join().expect("Couldn't join thread 2");
}

Most of the time it will only print this:

Parking1
Parking2

Every once in a while it will finish normally. This behavior happens on stable and nightly and I tried a handfull of stable versions all the way back to 1.12 and they all got stuck. Windows (both msvc and gnu appear to never get stuck (at least I couldn't reproduce the issue)). I was able to duplicate this an another mac as well (both running 10.14)

If only a single thread is started (or the park happens on the main thread) then I can't reproduce.

I'm on 10.14; uname: Darwin 34363bc7dc9c 18.2.0 Darwin Kernel Version 18.2.0: Thu Dec 20 20:46:53 PST 2018; root:xnu-4903.241.1~1/RELEASE_X86_64 x86_64 i386 MacBookPro11,3 Darwin

rustc 1.33.0 (2aa4c46cf 2019-02-28)
binary: rustc
commit-hash: 2aa4c46cfdd726e97360c2734835aa3515e8c858
commit-date: 2019-02-28
host: x86_64-apple-darwin
release: 1.33.0
LLVM version: 8.0
@sfackler
Copy link
Member

sfackler commented Mar 8, 2019

This definitely seems broken. Here's dtruss output:

	PID/THRD  SYSCALL(args) 		 = return
Parking1
Parking2
48727/0x4294e5:  open("/dev/dtracehelper\0", 0x2, 0xFFFFFFFFE70CECA0)		 = 3 0
48727/0x4294e5:  ioctl(0x3, 0x80086804, 0x7FFEE70CEC00)		 = 0 0
48727/0x4294e5:  close(0x3)		 = 0 0
48727/0x4294e5:  access("/AppleInternal/XBS/.isChrooted\0", 0x0, 0x0)		 = -1 Err#2
48727/0x4294e5:  thread_selfid(0x0, 0x0, 0x0)		 = 4363493 0
48727/0x4294e5:  bsdthread_register(0x7FFF5D23ABEC, 0x7FFF5D23ABDC, 0x2000)		 = 1073742047 0
48727/0x4294e5:  mprotect(0x108B93000, 0x1000, 0x0)		 = 0 0
48727/0x4294e5:  mprotect(0x108B9A000, 0x1000, 0x0)		 = 0 0
48727/0x4294e5:  mprotect(0x108B9B000, 0x1000, 0x0)		 = 0 0
48727/0x4294e5:  mprotect(0x108BA2000, 0x1000, 0x0)		 = 0 0
48727/0x4294e5:  mprotect(0x108B91000, 0x88, 0x1)		 = 0 0
48727/0x4294e5:  mprotect(0x108BA3000, 0x1000, 0x1)		 = 0 0
48727/0x4294e5:  mprotect(0x108B91000, 0x88, 0x3)		 = 0 0
48727/0x4294e5:  mprotect(0x108B91000, 0x88, 0x1)		 = 0 0
48727/0x4294e5:  issetugid(0x0, 0x0, 0x0)		 = 0 0
48727/0x4294e5:  getpid(0x0, 0x0, 0x0)		 = 48727 0
48727/0x4294e5:  stat64("/AppleInternal/XBS/.isChrooted\0", 0x7FFEE70CE2A8, 0x0)		 = -1 Err#2
48727/0x4294e5:  stat64("/AppleInternal\0", 0x7FFEE70CE340, 0x0)		 = -1 Err#2
48727/0x4294e5:  csops(0xBE57, 0x7, 0x7FFEE70CDDE0)		 = -1 Err#22
dtrace: error on enabled probe ID 2198 (ID 557: syscall::sysctl:return): invalid kernel access in action #11 at DIF offset 28
48727/0x4294e5:  csops(0xBE57, 0x7, 0x7FFEE70CD6D0)		 = -1 Err#22
48727/0x4294e5:  sigaction(0xD, 0x7FFEE70CFA48, 0x7FFEE70CFA70)		 = 0 0
dtrace: error on enabled probe ID 2198 (ID 557: syscall::sysctl:return): invalid kernel access in action #11 at DIF offset 28
48727/0x4294e5:  getrlimit(0x1003, 0x7FFEE70CFA60, 0x0)		 = 0 0
48727/0x4294e5:  mmap(0x7FFEE68D0000, 0x1000, 0x3, 0x1012, 0xFFFFFFFFFFFFFFFF, 0x0)		 = 0x7FFEE68D0000 0
48727/0x4294e5:  mprotect(0x7FFEE68D0000, 0x1000, 0x0)		 = 0 0
48727/0x4294e5:  sigaction(0xB, 0x7FFEE70CFA78, 0x0)		 = 0 0
48727/0x4294e5:  sigaction(0xA, 0x7FFEE70CFA78, 0x0)		 = 0 0
48727/0x4294e5:  sigaltstack(0x0, 0x7FFEE70CFAD0, 0x0)		 = 0 0
48727/0x4294e5:  mmap(0x0, 0x20000, 0x3, 0x1002, 0xFFFFFFFFFFFFFFFF, 0x0)		 = 0x108BA4000 0
48727/0x4294e5:  sigaltstack(0x7FFEE70CFAD0, 0x0, 0x0)		 = 0 0
48727/0x4294e5:  bsdthread_create(0x108B431C0, 0x7FDDC9C02730, 0x200000)		 = 104935424 0
48727/0x4294ec:  thread_selfid(0x0, 0x0, 0x0)		 = 4363500 0
48727/0x4294ec:  sigaltstack(0x0, 0x700006412EC0, 0x0)		 = 0 0
48727/0x4294e5:  bsdthread_create(0x108B431C0, 0x7FDDC9C029D0, 0x200000)		 = 107044864 0
48727/0x4294ec:  mmap(0x0, 0x20000, 0x3, 0x1002, 0xFFFFFFFFFFFFFFFF, 0x0)		 = 0x108BC4000 0
48727/0x4294ec:  sigaltstack(0x700006412EC0, 0x0, 0x0)		 = 0 0
48727/0x4294ed:  thread_selfid(0x0, 0x0, 0x0)		 = 4363501 0
48727/0x4294ed:  sigaltstack(0x0, 0x700006615EC0, 0x0)		 = 0 0
48727/0x4294ed:  mmap(0x0, 0x20000, 0x3, 0x1002, 0xFFFFFFFFFFFFFFFF, 0x0)		 = 0x108BE4000 0
48727/0x4294ed:  sigaltstack(0x700006615EC0, 0x0, 0x0)		 = 0 0
dtrace: error on enabled probe ID 2172 (ID 161: syscall::write:return): invalid kernel access in action #13 at DIF offset 68
48727/0x4294ec:  psynch_mutexdrop(0x7FDDC9D00110, 0x203, 0x100)		 = 0 0
48727/0x4294ed:  psynch_mutexwait(0x7FDDC9D00110, 0x203, 0x0)		 = 515 0
dtrace: error on enabled probe ID 2172 (ID 161: syscall::write:return): invalid kernel access in action #13 at DIF offset 68
48727/0x4294ec:  gettimeofday(0x700006412988, 0x0, 0x0)		 = 0 0
48727/0x4294ed:  gettimeofday(0x700006615988, 0x0, 0x0)		 = 0 0
48727/0x4294ec:  psynch_cvwait(0x7FDDC9C02830, 0x100000100, 0x0)		 = -1 Err#316
48727/0x4294ed:  psynch_cvwait(0x7FDDC9C02920, 0x100000100, 0x0)		 = -1 Err#316
^C
48727/0x4294e5:  __semwait_signal(0xD03, 0x0, 0x0)		 = -1 Err#4

@sfackler sfackler added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. C-bug Category: This is a bug. O-macos Operating system: macOS I-nominated and removed regression-from-stable-to-stable Performance or correctness regression from one stable version to another. labels Mar 8, 2019
@sfackler
Copy link
Member

sfackler commented Mar 8, 2019

It's a deadlock on the ONCE here: https://github.com/rust-lang/rust/blob/master/src/libstd/sys/unix/time.rs#L241

* thread #2
  * frame #0: 0x00007fff5d073a46 libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff5d23bb9d libsystem_pthread.dylib`_pthread_mutex_lock_wait + 83
    frame #2: 0x00007fff5d2394c8 libsystem_pthread.dylib`_pthread_mutex_lock_slow + 253
    frame #3: 0x0000000100008c05 foo`std::thread::park::h40cbaa717417ef95 [inlined] std::sys::unix::mutex::Mutex::lock::h1c54f0b7e37dd0a3 at mutex.rs:56 [opt]
    frame #4: 0x0000000100008c00 foo`std::thread::park::h40cbaa717417ef95 [inlined] std::sys_common::mutex::Mutex::raw_lock::hecb01a8a1467a3f9 at mutex.rs:36 [opt]
    frame #5: 0x0000000100008c00 foo`std::thread::park::h40cbaa717417ef95 [inlined] _$LT$std..sync..mutex..Mutex$LT$T$GT$$GT$::lock::h012f35fcc06840a7 at mutex.rs:220 [opt]
    frame #6: 0x0000000100008bfc foo`std::thread::park::h40cbaa717417ef95 at mod.rs:891 [opt]
    frame #7: 0x000000010000d4ed foo`std::sync::once::Once::call_inner::h0b29d64214dba756 at once.rs:421 [opt]
    frame #8: 0x000000010000d7ed foo`std::time::Instant::duration_since::h6b1dd60fca9d26e8 [inlined] std::sync::once::Once::call_once::h15f56e63f2b5f3bd at once.rs:220 [opt]
    frame #9: 0x000000010000d7c8 foo`std::time::Instant::duration_since::h6b1dd60fca9d26e8 [inlined] std::sys::unix::time::inner::info::he0ee845c3bb0a3ac at time.rs:242 [opt]
    frame #10: 0x000000010000d7c8 foo`std::time::Instant::duration_since::h6b1dd60fca9d26e8 [inlined] std::sys::unix::time::inner::Instant::sub_instant::h2f90455c1f5f50e3 at time.rs:154 [opt]
    frame #11: 0x000000010000d7c8 foo`std::time::Instant::duration_since::h6b1dd60fca9d26e8 at time.rs:218 [opt]
    frame #12: 0x0000000100012255 foo`std::sys::unix::condvar::Condvar::wait_timeout::hdd4a51d3f762f929 [inlined] _$LT$std..time..Instant$u20$as$u20$core..ops..arith..Sub$GT$::sub::h0703e52941cc9b4d at time.rs:307 [opt]
    frame #13: 0x000000010001224d foo`std::sys::unix::condvar::Condvar::wait_timeout::hdd4a51d3f762f929 [inlined] std::time::Instant::elapsed::h544a171a1c00e632 at time.rs:242 [opt]
    frame #14: 0x0000000100012240 foo`std::sys::unix::condvar::Condvar::wait_timeout::hdd4a51d3f762f929 at condvar.rs:162 [opt]
    frame #15: 0x0000000100008fe8 foo`std::thread::park_timeout::h730ea3969ae2f6b8 [inlined] std::sys_common::condvar::Condvar::wait_timeout::h8f2d65bc554e47d5 at condvar.rs:51 [opt]
    frame #16: 0x0000000100008fdd foo`std::thread::park_timeout::h730ea3969ae2f6b8 [inlined] std::sync::condvar::Condvar::wait_timeout::h552dfa042e4361fb at condvar.rs:405 [opt]
    frame #17: 0x0000000100008fc3 foo`std::thread::park_timeout::h730ea3969ae2f6b8 at mod.rs:1003 [opt]
    frame #18: 0x0000000100003203 foo`foo::main::_$u7b$$u7b$closure$u7d$$u7d$::h014edfb0ea6d8161((null)=closure {

}) at main.rs:9
    frame #19: 0x0000000100003bed foo`std::sys_common::backtrace::__rust_begin_short_backtrace::hfdb85b8c915ae023(f=closure {

}) at backtrace.rs:135
    frame #20: 0x00000001000068fd foo`std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::hcb677bec99ca4490 at mod.rs:469
    frame #21: 0x00000001000057ed foo`_$LT$std..panic..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::hc80622d5e14b1c63(self=AssertUnwindSafe<closure>(closure {

}), _args=<unavailable>) at panic.rs:309
    frame #22: 0x00000001000050ca foo`std::panicking::try::do_call::hcae109a146a05cf6(data=&0x70000ef8bd38) at panicking.rs:297
    frame #23: 0x000000010001384f foo`__rust_maybe_catch_panic at lib.rs:92 [opt]
    frame #24: 0x0000000100004f33 foo`std::panicking::try::h4483321badd4a86e(f=AssertUnwindSafe<closure>(closure {

})) at panicking.rs:276
    frame #25: 0x000000010000583d foo`std::panic::catch_unwind::h9d60a569083fc382(f=AssertUnwindSafe<closure>(closure {

})) at panic.rs:388
    frame #26: 0x0000000100006766 foo`std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::hed8c8ea5ca4cd51c at mod.rs:468
    frame #27: 0x0000000100006abf foo`_$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::h781bdd77c151bf87(self=&0x100300380, args=<unavailable>) at boxed.rs:734
    frame #28: 0x000000010001324c foo`std::sys::unix::thread::Thread::new::thread_start::h2ccd51efe09a6d88 [inlined] _$LT$alloc..boxed..Box$LT$$LP$dyn$u20$alloc..boxed..FnBox$LT$A$C$$u20$Output$u3d$R$GT$$u20$$u2b$$u20$$u27$a$RP$$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::hf680f30793070da1 at boxed.rs:744 [opt]
    frame #29: 0x0000000100013249 foo`std::sys::unix::thread::Thread::new::thread_start::h2ccd51efe09a6d88 [inlined] std::sys_common::thread::start_thread::h30434b5681199bc5 at thread.rs:14 [opt]
    frame #30: 0x00000001000131ce foo`std::sys::unix::thread::Thread::new::thread_start::h2ccd51efe09a6d88 at thread.rs:81 [opt]
    frame #31: 0x00007fff5d23b661 libsystem_pthread.dylib`_pthread_body + 340
    frame #32: 0x00007fff5d23b50d libsystem_pthread.dylib`_pthread_start + 377
    frame #33: 0x00007fff5d23abf9 libsystem_pthread.dylib`thread_start + 13

@sfackler
Copy link
Member

sfackler commented Mar 8, 2019

As a workaround, everything should work fine as long as you construct an Instant at least once in your process before parking.

@lespea
Copy link
Author

lespea commented Mar 8, 2019

Doesn't seem to help :(

Code

use std::thread;
use std::time::Duration;
use std::time;

fn main() {
    println!("{:?}", time::Instant::now());

    let t1 = thread::spawn(|| {
        println!("{:?}", time::Instant::now());
        let sleep = Duration::new(0,100_000);
        for _ in 0..100 {
            println!("Parking1");
            thread::park_timeout(sleep);
        }
    });

    let t2 = thread::spawn(|| {
        println!("{:?}", time::Instant::now());
        let sleep = Duration::new(0,100_000);
        for _ in 0..100 {
            println!("Parking2");
            thread::park_timeout(sleep);
        }
    });

    t1.join().expect("Couldn't join thread 1");
    t2.join().expect("Couldn't join thread 2");
}

Output

Instant { t: 18865250239116 }
Instant { t: 18865250405775 }
Parking1
Instant { t: 18865250430637 }
Parking2

@sfackler
Copy link
Member

sfackler commented Mar 8, 2019

Oh right sorry, you actually have to subtract two Instants to force that Once to complete. If you instead do time::Instant::now().elapsed() at the start of main, it prevents the deadlock.

jonhoo added a commit to jonhoo/bus that referenced this issue Mar 8, 2019
@lespea
Copy link
Author

lespea commented Mar 8, 2019

Works perfectly now! Thanks!

alexcrichton added a commit to alexcrichton/rust that referenced this issue Apr 3, 2019
This commit removes usage of `Once` from the internal implementation of
time utilities on OSX and Windows. It turns out that we accidentally hit
a deadlock today (rust-lang#59020) via events that look like:

* A thread invokes `park_timeout`
* Internally, only on OSX, `park_timeout` calls `Instant::elapsed`
* Inside of `Instant::elapsed` on OSX we enter a `Once` to initialize
  global timer data
* Inside of `Once`, it attempts to `park`

This means on the same stack frame, when there's contention, we're
calling `park` from inside `park_timeout`, causing a deadlock!

The solution implemented in this commit was to remove usage of `Once`
and instead just do a small dance with atomics. There's no real need we
need to guarantee that the global information is only learned once, only
that it's only *stored* once. This implementation may have multiple
threads invoke `mach_timebase_info`, but only one will store the global
information which will amortize the cost for all other threads.

A similar fix has been applied to windows to be uniform across our
implementations, but looking at the code on Windows no deadlock was
possible. This is purely just a consistency update for Windows and in
theory a slightly leaner implementation.

Closes rust-lang#59020
Centril added a commit to Centril/rust that referenced this issue Apr 3, 2019
std: Avoid usage of `Once` in `Instant`

This commit removes usage of `Once` from the internal implementation of
time utilities on OSX and Windows. It turns out that we accidentally hit
a deadlock today (rust-lang#59020) via events that look like:

* A thread invokes `park_timeout`
* Internally, only on OSX, `park_timeout` calls `Instant::elapsed`
* Inside of `Instant::elapsed` on OSX we enter a `Once` to initialize
  global timer data
* Inside of `Once`, it attempts to `park`

This means on the same stack frame, when there's contention, we're
calling `park` from inside `park_timeout`, causing a deadlock!

The solution implemented in this commit was to remove usage of `Once`
and instead just do a small dance with atomics. There's no real need we
need to guarantee that the global information is only learned once, only
that it's only *stored* once. This implementation may have multiple
threads invoke `mach_timebase_info`, but only one will store the global
information which will amortize the cost for all other threads.

A similar fix has been applied to windows to be uniform across our
implementations, but looking at the code on Windows no deadlock was
possible. This is purely just a consistency update for Windows and in
theory a slightly leaner implementation.

Closes rust-lang#59020
alexcrichton added a commit to alexcrichton/rust that referenced this issue Apr 4, 2019
This commit removes usage of `Once` from the internal implementation of
time utilities on OSX and Windows. It turns out that we accidentally hit
a deadlock today (rust-lang#59020) via events that look like:

* A thread invokes `park_timeout`
* Internally, only on OSX, `park_timeout` calls `Instant::elapsed`
* Inside of `Instant::elapsed` on OSX we enter a `Once` to initialize
  global timer data
* Inside of `Once`, it attempts to `park`

This means on the same stack frame, when there's contention, we're
calling `park` from inside `park_timeout`, causing a deadlock!

The solution implemented in this commit was to remove usage of `Once`
and instead just do a small dance with atomics. There's no real need we
need to guarantee that the global information is only learned once, only
that it's only *stored* once. This implementation may have multiple
threads invoke `mach_timebase_info`, but only one will store the global
information which will amortize the cost for all other threads.

A similar fix has been applied to windows to be uniform across our
implementations, but looking at the code on Windows no deadlock was
possible. This is purely just a consistency update for Windows and in
theory a slightly leaner implementation.

Closes rust-lang#59020
bors added a commit that referenced this issue Apr 4, 2019
std: Avoid usage of `Once` in `Instant`

This commit removes usage of `Once` from the internal implementation of
time utilities on OSX and Windows. It turns out that we accidentally hit
a deadlock today (#59020) via events that look like:

* A thread invokes `park_timeout`
* Internally, only on OSX, `park_timeout` calls `Instant::elapsed`
* Inside of `Instant::elapsed` on OSX we enter a `Once` to initialize
  global timer data
* Inside of `Once`, it attempts to `park`

This means on the same stack frame, when there's contention, we're
calling `park` from inside `park_timeout`, causing a deadlock!

The solution implemented in this commit was to remove usage of `Once`
and instead just do a small dance with atomics. There's no real need we
need to guarantee that the global information is only learned once, only
that it's only *stored* once. This implementation may have multiple
threads invoke `mach_timebase_info`, but only one will store the global
information which will amortize the cost for all other threads.

A similar fix has been applied to windows to be uniform across our
implementations, but looking at the code on Windows no deadlock was
possible. This is purely just a consistency update for Windows and in
theory a slightly leaner implementation.

Closes #59020
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
C-bug Category: This is a bug. O-macos Operating system: macOS T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants