Skip to content

Increased amount of ANRs after disabling concurrent GC #9365

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
TimBurik opened this issue Oct 3, 2024 · 56 comments
Open

Increased amount of ANRs after disabling concurrent GC #9365

TimBurik opened this issue Oct 3, 2024 · 56 comments
Assignees
Labels
Area: Mono Runtime Mono-related issues: BCL bugs, AOT issues, etc.

Comments

@TimBurik
Copy link

TimBurik commented Oct 3, 2024

Android framework version

net8.0-android

Affected platform version

.NET 8.0.303

Description

After switching from Xamarin.Android to .Net8 we used to get a lot of native crashes in monosgen, for example 222003f52fa3496385d14a89c778a6e4-symbolicated.txt

After long investigation (and a hint from the issue dotnet/runtime#100311) It turns out that concurrent SGen is actually enabled by default in .net-android

<AndroidEnableSGenConcurrent Condition=" '$(AndroidEnableSGenConcurrent)' == '' ">true</AndroidEnableSGenConcurrent>

so we explicitly disable it - and now the amount of native crashes in monosgen is minimal, but instead we are getting a lot of ANR reports in Sentry and GooglePlay Console.

ANRs seems to be reported using Android's ApplicationExitInfo mechanism, according to stacktraces main thread seems to be blocked by awaiting native mutex somewhere in the monosgen (example: anr_stacktrace.txt)

Additional information, which might be relevant:

  • according to GooglePlay Console, majority of the ANRs related to the broadcast of either Intent { act=android.intent.action.SCREEN_OFF } or Intent { act=android.intent.action.SCREEN_ON };
  • majority of ANRs seems to be happening when application is in the background;

Steps to Reproduce

Unfortunately, we don't have exact steps to reproduce.
The only thing that is sure that it is happening when targeting .net-android34.0 (version for Xamarin.Android doesn't have this issue) and issue started happening after adding the following to the csproj:
<AndroidEnableSGenConcurrent>false</AndroidEnableSGenConcurrent>

Did you find any workaround?

No workaround found yet

Relevant log output

No response

@TimBurik TimBurik added Area: App Runtime Issues in `libmonodroid.so`. needs-triage Issues that need to be assigned. labels Oct 3, 2024
@grendello
Copy link
Contributor

@TimBurik I'm afraid we can't do anything for you here, this appears to be an issue split between MonoVM and the Sentry Native SDK. @lambdageek do you think someone on your side could take a look to see if anything can be done here?

There's an issue being worked on related to Sentry, #9055, perhaps this here issue is also caused by the same problems?

@supervacuus, sorry for tagging you out of the blue, but would you be able to look into the Sentry side of things here?

I realize we have precious little information, but perhaps someone will be able to spot something in their respective areas and help out.

@grendello grendello added Area: Mono Runtime Mono-related issues: BCL bugs, AOT issues, etc. and removed Area: App Runtime Issues in `libmonodroid.so`. needs-triage Issues that need to be assigned. labels Oct 4, 2024
@TimBurik
Copy link
Author

TimBurik commented Oct 4, 2024

@grendello thank you for the response.

Regarding #9055 - we are aware of this issue, and we have a separate native crash group, which is most likely related to this issue. But the ANRs doesn't seem to be related to this issue and to Sentry, because we are seeing exactly the same picture in the GooglePlay Console - increased amount of ANRs, all containing libmonosgen-2.0 in stack traces

@grendello
Copy link
Contributor

The GC ANRs might be related to something inside the GC bridge, hopefully @lambdageek will be able to help out here. However, this ANR looks familiar:

#00  pc 0x0000000000089cf0  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+32)
#01  pc 0x000000000008e6d8  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+148)
#02  pc 0x000000000009c3dc  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+112)
#03  pc 0x00000000001eeb20  /data/app/~~x6iHlhee06D2KSrJ3_D8mw==/<app.bundle.id>-UM4qcDy_vV_DZ6DnDt2oCA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so

You are most likely not using marshal methods, since they weren't enabled by default in NET8, but we've fixed an issue recently which was related to them and one of the repros had the above stack trace as well. The problem was related to Java ("native") threads being incorrectly attached to the MonoVM runtime and putting the GC in a bad state. You can find more info about it in this and the following comments. @filipnavara, does this look familiar?

Perhaps something in your app causes a similar corruption of the GC state?

@TimBurik if you're able to reproduce the ANRs locally, would you be able to test with NET9 rc2?

@TimBurik
Copy link
Author

TimBurik commented Oct 4, 2024

@grendello thank you for the pointers!

We are planning to test the app on .NET 9 in the nearest future, we would make sure to use RC2 for those tests (when RC2 would be available).

Also, we have added debug symbols for libmonosgen-2.0 into GooglePlay Console and now we have more detailed logs. For example, this is the reported main thread stack trace from one of the ANRs:

  #00  pc 0x0000000000078dec  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
  #01  pc 0x000000000007d7e0  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144)
  #02  pc 0x000000000008ba8c  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+108)
  #03  pc 0x00000000001eeb20  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_os_sem_timedwait+204) 
  #04  pc 0x00000000001ee844  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_threads_wait_pending_operations+323) 
  #05  pc 0x00000000002a50f4  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (unified_suspend_stop_world+349) 
  #06  pc 0x00000000002a4e84  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_client_stop_world+159) 
  #07  pc 0x00000000002c5350  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_stop_world+3974) 
  #08  pc 0x00000000002c21bc  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_perform_collection+2637) 
  #09  pc 0x00000000002c20ec  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_ensure_free_space+2616) 
  #10  pc 0x00000000002be8b8  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj_nolock+279) 
  #11  pc 0x00000000002bed9c  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj+454) 
  #12  pc 0x00000000002a686c  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_gc_alloc_obj+690) 
  #13  pc 0x0000000000004438 

@grendello
Copy link
Contributor

@TimBurik this more and more looks like the issue fixed by @filipnavara, he declared he'd take a look next week - it would be great if you were able to test with NET9 RC2 before then :)

@grendello
Copy link
Contributor

@TimBurik if RC2 isn't available by next week, do try with RC1 - it's to see if the issues you're seeing still even exist in NET9

@supervacuus
Copy link

I've been following the thread since you tagged me. It seems this isn't sentry-related. I've informed sentry peeps, just in case.

@grendello
Copy link
Contributor

@supervacuus thanks, I tagged you just in case - this is all low-level stuff, it might be affecting Sentry as well in some ways.

@lambdageek
Copy link
Member

/cc @vitek-karas

@TimBurik
Copy link
Author

@grendello sorry for late response
we've just released .Net9-rc2 build to a small population in production (we still don't have a reliable way to reproduce it), and unfortunately we see the same ANR reports in the GooglePlay Console.
Here are some more examples of the main thread stack traces from those reports:

#00  pc 0x000000000004f85c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
#01  pc 0x0000000000054330  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144)
#02  pc 0x0000000000062d4c  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+108)
#03  pc 0x00000000001faecc  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_os_sem_timedwait+204)
#04  pc 0x00000000001fadf4  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_threads_wait_pending_operations+329)
#05  pc 0x00000000002b0e24  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (unified_suspend_stop_world+349)
#06  pc 0x00000000002b0bcc  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_client_stop_world+159)
#07  pc 0x00000000002c6ec8  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_stop_world+3988)
#08  pc 0x00000000002c3e44  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_perform_collection+2637)
#09  pc 0x00000000002c3d74  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_ensure_free_space+2616)
#10  pc 0x00000000002c0618  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj_nolock+279)
#11  pc 0x00000000002b2f84  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_gc_alloc_vector+1119)
#12  pc 0x000000000000cb00 
#00  pc 0x000000000001daec  /system/lib64/libc.so (syscall+28)
#01  pc 0x0000000000028a88  /system/lib64/libc.so (sem_wait+88)
#02  pc 0x00000000001fac20  /data/app/<app.bundle.id>-fM0-AVfJ2sE77in08Tz1vA==/lib/arm64/libmonosgen-2.0.so (mono_thread_info_wait_for_resume+204)
#03  pc 0x000000000020035c  /data/app/<app.bundle.id>-fM0-AVfJ2sE77in08Tz1vA==/lib/arm64/libmonosgen-2.0.so (mono_threads_enter_gc_unsafe_region_unbalanced_with_info+503)
#04  pc 0x0000000000283944  /data/app/<app.bundle.id>-fM0-AVfJ2sE77in08Tz1vA==/lib/arm64/libmonosgen-2.0.so (mono_threads_attach_coop_internal+4712)
#05  pc 0x0000000000283978  /data/app/<app.bundle.id>-fM0-AVfJ2sE77in08Tz1vA==/lib/arm64/libmonosgen-2.0.so (mono_threads_attach_coop+4731)
#06  pc 0x000000000000b5d8
#00  pc 0x00000000000902fc  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
#01  pc 0x00000000000950d0  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144)
#02  pc 0x00000000000a3670  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+112)
#03  pc 0x00000000001faecc  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_os_sem_timedwait+204)
#04  pc 0x00000000001fadf4  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_threads_wait_pending_operations+329)
#05  pc 0x00000000002b0e24  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (unified_suspend_stop_world+349)
#06  pc 0x00000000002b0bcc  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_client_stop_world+159)
#07  pc 0x00000000002c6ec8  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_stop_world+3988)
#08  pc 0x00000000002c3e44  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_perform_collection+2637)
#09  pc 0x00000000002c3d74  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_ensure_free_space+2616)
#10  pc 0x00000000002c0618  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj_nolock+279)
#11  pc 0x00000000002c0afc  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj+454)
#12  pc 0x00000000002b2588  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_gc_alloc_obj+690)
#13  pc 0x0000000000274ea0  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_object_new_specific_checked+5494)
#14  pc 0x00000000002746e4  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (extract_this_ptr+5341)
#15  pc 0x0000000000274b70  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_runtime_try_invoke_byrefs+5202)
#16  pc 0x000000000022fa28  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (ves_icall_InternalInvoke+3629)
#17  pc 0x0000000000238e40  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (ves_icall_InternalInvoke_raw+370)
#18  pc 0x0000000000004494
#00  pc 0x000000000008921c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
#01  pc 0x000000000008db00  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144)
#02  pc 0x000000000009c5a8  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+104)
#03  pc 0x00000000001faecc  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_os_sem_timedwait+204)
#04  pc 0x00000000001fadf4  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_threads_wait_pending_operations+329)
#05  pc 0x00000000002b0e24  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (unified_suspend_stop_world+349)
#06  pc 0x00000000002b0bcc  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_client_stop_world+159)
#07  pc 0x00000000002c6ec8  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_stop_world+3988)
#08  pc 0x00000000002c3e44  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_perform_collection+2637)
#09  pc 0x00000000002c3d74  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_ensure_free_space+2616)
#10  pc 0x00000000002c0618  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj_nolock+279)
#11  pc 0x00000000002b3168  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_gc_alloc_string+1208)
#12  pc 0x000000000026da44  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_string_new_size_checked+6308)
#13  pc 0x000000000027690c  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_string_new_size_handle+6282)
#14  pc 0x000000000023d97c  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (ves_icall_System_String_FastAllocateString_raw+550)
#15  pc 0x0000000000008024 

@BrzVlad
Copy link
Member

BrzVlad commented Oct 22, 2024

I don't know how the ANR system works, I would imagine they happen if the app is non responsive for a while. If this is the case, I'm not sure whether this issue can actually be considered to be a bug. We seem to have a problem with the concurrent GC in the runtime and this issue needs to be investigated somehow. Disabling concurrent GC will just lead to longer GC pause times which seem to trigger this ANR. Is a GC pause time of 0.5 - 1s supposed to trigger this ANR ? If this is true, it seems to me that the best way to move forward would be to share the application so we can see if the serial GC takes longer than expected for whatever reason and mainly to try some debug flags to see if we detect some problems with the concurrent collector.

@TimBurik
Copy link
Author

@BrzVlad according to the Android documentation, one of the reasons for ANR is when BroadcastReceiver has failed to process an event within 5 seconds - this seems to be our case, as there's an additional description in the GooglePlay Console like "Broadcast of Intent { act=android.intent.action.SCREEN_ON }"

By the way, let me share a full report from the GooglePlay Console, which contains stack traces from all the threads: anr_all_stacktraces.txt
Maybe it contains some hints to what the main thread may be waiting for.

I could also try to gather additional information regarding the GC performance, but in general we didn't notice any significant performance degradation after disabling concurrent GC

@grendello
Copy link
Contributor

@TimBurik is the BroadcastReceiver in your case running in an isolated process?

@BrzVlad if it's a problem with BroadcastReceiver, then perhaps it's "simply" a matter of a Java thread incorrectly attached to MonoVM?

@TimBurik
Copy link
Author

TimBurik commented Oct 22, 2024

@grendello in our case most of the ANRs are related to the SystemEventsBroadcastReceiver from Sentry, which I believe is run in the same process by default
https://github.com/getsentry/sentry-java/blob/4988d5bf9e0a5be708853cf53d07b8503e4c77be/sentry-android-core/src/main/java/io/sentry/android/core/SystemEventsBreadcrumbsIntegration.java#L124-L142

But this is not the only source for ANRs for us, as we also have reports:

All of those reports have very similar main thread stack traces, related to GC.

It's even seems that those broadcast events, service calls and input events are not really causing the issue, and the main thread is already blocked by the time these events occur.

@TimBurik
Copy link
Author

@grendello @BrzVlad is there anything else we could do to help you shed the light on this issue?
Unfortunately, I don't believe we would be able to share the app, and, as I've mentioned before, we don't really have a reliable way to reproduce the issue yet.. But we sure could check other theories or gather additional information, if you have any ideas.

@BrzVlad
Copy link
Member

BrzVlad commented Oct 23, 2024

Seing that this issue happens after disabling concurrent GC I was biased into assuming that the GCs simply take longer to run and this is causing issues. After a second look at the stacktraces, it indeed looks like the suspend machinery is hanging. It makes no sense to me why this issue would only happen with serial GC though. We have an env var MONO_SLEEP_ABORT_LIMIT which makes it such that the application will abort if the suspend takes longer than the provided limit in ms, while also printing each thread state to the console. I think you should set this var to something that wouldn't trigger ANRs, like 1000ms maybe (suspending threads shouldn't take more than a few milliseconds). If we get explicit crashes in the runtime then this should confirm that the issue is in our runtime and maybe also make it easier to reproduce if we set a more aggressive suspend limit locally.

@TimBurik
Copy link
Author

TimBurik commented Nov 7, 2024

@BrzVlad could you also give a hint to where the threads dump could be found?
We did manage to find a pretty consistent way to reproduce the "stop the world hang" (although with an extreme abort limit of 40ms). Hang is always happening after a few minutes when application is backgrounded, but never happens while the app is in use.

In logcat we see the message:
E suspend_thread suspend took 40 ms, which is more than the allowed 40 ms
but the threads dump, which should be printed before, is nowhere to be found.

We are using the following values as Android environment variables:

MONO_SLEEP_ABORT_LIMIT=40
MONO_LOG_LEVEL=info
MONO_LOG_MASK=gc

and also we are using the adb shell setprop debug.mono.log gc command.

@BrzVlad
Copy link
Member

BrzVlad commented Nov 7, 2024

I would have expected that the suspend state of each thread to be printed but it looks like that logging path doesn't go through the android log but rather to the console. It would be useful for us to try to reproduce this failure but given the abort limit is not excessive and it only happens while in background, it makes me wonder whether it just the os scheduling the app less frequently and the suspend process just taking a bit longer than usual, because threads reach GC safepoint location slower.

@TimBurik
Copy link
Author

TimBurik commented Nov 7, 2024

This issue doesn't seem to affect every GC in the background, in the log cat we also see a couple of GC-related logs:

2024-11-07 16:08:11.722 19003-19011 <app.id>  I  Background concurrent copying GC freed 5559KB AllocSpace bytes, 57(1168KB) LOS objects, 49% free, 11MB/23MB, paused 3.069ms,1.275ms total 242.239ms
2024-11-07 16:08:55.162 19003-19003 <app.id>  I  Explicit concurrent copying GC freed 12MB AllocSpace bytes, 20(400KB) LOS objects, 49% free, 8040KB/15MB, paused 442us,230us total 150.742ms
.......
2024-11-07 16:09:31.504 19003-19011 <app.id>  I  Background young concurrent copying GC freed 6048KB AllocSpace bytes, 0(0B) LOS objects, 38% free, 9960KB/15MB, paused 577us,289us total 106.764ms
2024-11-07 16:09:58.109 19003-19011 <app.id>  I  Background concurrent copying GC freed 4932KB AllocSpace bytes, 0(0B) LOS objects, 49% free, 10MB/21MB, paused 196us,144us total 100.312ms
.......
2024-11-07 16:10:18.513 19003-19003 <app.id>  E  suspend_thread suspend took 40 ms, which is more than the allowed 40 ms

I'm also gonna try to reproduce this issue with bigger values for MONO_SLEEP_ABORT_LIMIT, just to make sure that this is a proper "hanging", not just performance degradation spike of the background process. But it would be sure useful to collect as much information as possible when this reproduction happens. Do you know if there is any way to extract those threads dump messages from the Android device?

@TimBurik
Copy link
Author

TimBurik commented Nov 8, 2024

@BrzVlad Just managed to reproduce the issue with limit value of 1s:

E suspend_thread suspend took 1000 ms, which is more than the allowed 1000 ms

The reproduction steps are the same - put application to the background and wait, although this time it required also several tries (foregrounding/bacgrounding app back) and considerable amount of waiting (30m+ after each backgrounding)

@BrzVlad
Copy link
Member

BrzVlad commented Nov 12, 2024

Would it be possible to provide the sample app with some reproduction steps ?

@TimBurik
Copy link
Author

Let me try to reproduce the issue on the sample app, and also discuss with the legal team possibility to share our production code.

Meanwhile, I have found a similar issue with a deadlock in GC stopping the world mono/mono#21582, which ended up being fixed by changing the value of the MONO_THREADS_SUSPEND. @BrzVlad do you know what is the default thread suspension mode for .net-android? Could this mode be related to this issue?

@lateralusX
Copy link
Member

lateralusX commented Mar 14, 2025

@TimBurik I also looked at your latest ANR dumps:

ANR_suspend_signal_handler_full1.txt:

This one seems to be truncated so there are threads cut out but I it still looks like we hit a rather interesting scenario. Main thread does some work that will hold loader lock, taken in mono_class_vtable_checked. Then there is "Thread-4" that runs finalizers and does a stackwalk, but it does a suspend and run to do the stackwalk, so I assume it has suspended Main thread while holding the loader lock. Then as part of the stackwalk it hits a case where it needs to create a class instance, that needs the loader lock, leading to a deadlock.

The only way to request a thread dump that will trigger the stackwalk is to explicit call mono_threads_request_thread_dump API. In mono that happens only in the SIGQUIT handler, so that signal seems to been fired and this is a side effect leading up to the deadlock. I can probably fix this issue by making sure the stackwalk done in response to mono_threads_request_thread_dump is async safe since its designed to be called from within signal handlers, so it expects to only do signal safe operations, meaning it won't try to load additional classes if not already loaded etc. Since this is a side effect of a SIGQUIT signal, it will of course not tackle why that happens, but in the dump no other threads seems to be doing anything strange.

ANR_suspend_signal_handler_full2.txt:

This one is slightly different, but it ends up in similar deadlock. In this case we have thread "AndroidAnrWatchDog" that calls into Mono runtime to get a type name, that leads up to a SIGSEGV (probably due to a heap corruption), that will call our SIGSEGV handler that does a native crash, triggering a full stackwalk of crashing thread. In parallel there is code triggering a GC, "Thread-14", it holds GC lock and wait for threads to suspend, another thread "GLThread 1071" gets suspended while holding loader lock and the thread handling the SIGSEGV is about stackwalk, but doesn't do it async safe, leading up to a deadlock on loader lock, so it deadlocks inside the signal handler, meaning that the suspend signal also send to this thread will wait, meaning that the GC thread will wait forever on this thread.

Making sure we do async safe stackwalk from within crash handler would fix this chain of events, but it won't fix the root cause that is probably a heap corruption, but at least it should make it into a crash report intsead of an ANR.

With these two dumps I have a couple of things to look at, it won't solve the underlying triggers (SIGQUIT and heap corruption) but it will at least make sure we would handle this complex error reporting scenarios better.

lateralusX added a commit to lateralusX/runtime that referenced this issue Mar 17, 2025
Mono's stack-walker can be signal/async safe when needed,
making sure it won't allocate additional memory or taking internal
runtime locks. It is however up to the caller of the stack walking API's
to decide if it should be signal and/or async safe. Currently this is
controlled using two different flags, the MONO_UNWIND_SIGNAL_SAFE as well
as mono_thread_info_is_async_context.

This is problematic since callers wants signal safe stack-walking but
since not both are set, it will not behave fully signal safe.

dotnet/android#9365 hit a couple of scenarios
described here:

dotnet/android#9365 (comment)

that ends up deadlocking due to the fact that they did stack-walking
from within a signal handler and deadlocked or dumping stack on suspended
thread holding runtime loader lock, but without making the stack-walk async safe.

Fix makes sure that calls to stack-walk API's can be made signal and/or
async safe and that identified areas uses the correct set of flags given
state of threads when stack-walking.
@lateralusX
Copy link
Member

Following PR should at least fix deadlocks in above ANR's, dotnet/runtime#113645.

@lateralusX
Copy link
Member

lateralusX commented Mar 18, 2025

@TimBurik I looked back in history of this issue and realized that you had a local repro of the issue? Is that correct? If so, would it be possible for you to build a local runtime with some additional instrumentation (logging additional data into logcat) and use that together with your repro app? That way we could potentially get an idea around what thread that doesn't seem to suspend and then we could map that with the ANR to get what thread(s) we are waiting on. We could first try this with the original hybrid suspend mode and then potentially with preemptive suspend (with above fix) for even more additional information.

@TimBurik
Copy link
Author

@lateralusX Thank you very much for taking your time to look into the dumps!


I can probably fix this issue by making sure the stackwalk done in response to mono_threads_request_thread_dump is async safe since its designed to be called from within signal handlers, so it expects to only do signal safe operations, meaning it won't try to load additional classes if not already loaded etc.

This is actually sounds great! We were suspecting that some of those ANRs are happening while processing crashes (because of the high amount of "background" ANRs in GooglePlay Console). We would very much prefer those crashes reported as crashes, and not as ANRs, as we currently have disproportionally high ANR rate comparing to the crash rate in production.


In this case we have thread "AndroidAnrWatchDog" that calls into Mono runtime to get a type name, that leads up to a SIGSEGV (probably due to a heap corruption), that will call our SIGSEGV handler that does a native crash, triggering a full stackwalk of crashing thread.

This one seems to be actually related to another ticket of ours related to crashes: #9563 It seems that our custom ANR monitor implementation ("AndroidAnrWatchDog" thread) was hitting some kind of infinite loop while collecting the context, requesting type name information over and over and over again, probably causing heap corruption in the end. Either way, after disabling this functionality the crash group has gone, and related ANRs probably as well.


I looked back in history of this issue and realized that you had a local repro of the issue? Is that correct?

Yes, we do have a somewhat stable reproduction of the GC deadlock with a precise configuration:

  • <AndroidEnableSGenConcurrent> set to false;
  • MONO_THREADS_SUSPEND unset ( set to default "hybrid"?). When set to "preemptive" deadlock cannot be reproduced with the same steps;
  • it seems to be reproduced only on Samsung phones, although we haven't tested this theory much;

would it be possible for you to build a local runtime with some additional instrumentation (logging additional data into logcat) and use that together with your repro app?

Sure! Do you have any specific ideas in mind of what instrumentation we could add/turn-on in the runtime? We would be able to run the test some time next week.

@lateralusX
Copy link
Member

lateralusX commented Mar 19, 2025

@TimBurik Sounds great. Let me write up some instructions during the week and share. Are you just using an Android SDK app (dotnet new android) or MAUI (dotnet new maui)? You will need to do a local build of the runtime enabling some additional logging in code, then you will use that version of the runtime when building your app running the reproduction step. The additional logging will hit logcat, so we will need to get some data out of logcat once the ANR reproduces, it will tell us more details on the threads that we are waiting on as part of the STW (Stop The World). When we have that data we can look in the ANR to detect what thread seems to be in wrong state and why it won't seem to reach a safe point in a timely manner.

@TimBurik
Copy link
Author

@lateralusX we are using Android SDK (android workload)
We are aware of possibility to build local .NET for Android via https://github.com/dotnet/android/blob/main/Documentation/building/unix/instructions.md, but we haven't tried it before

lateralusX added a commit to dotnet/runtime that referenced this issue Mar 20, 2025
* [Mono]: Fix additional stack-walks to be async safe.

Mono's stack-walker can be signal/async safe when needed,
making sure it won't allocate additional memory or taking internal
runtime locks. It is however up to the caller of the stack walking API's
to decide if it should be signal and/or async safe. Currently this is
controlled using two different flags, the MONO_UNWIND_SIGNAL_SAFE as well
as mono_thread_info_is_async_context.

This is problematic since callers wants signal safe stack-walking but
since not both are set, it will not behave fully signal safe.

dotnet/android#9365 hit a couple of scenarios
described here:

dotnet/android#9365 (comment)

that ends up deadlocking due to the fact that they did stack-walking
from within a signal handler and deadlocked or dumping stack on suspended
thread holding runtime loader lock, but without making the stack-walk async safe.

Fix makes sure that calls to stack-walk API's can be made signal and/or
async safe and that identified areas uses the correct set of flags given
state of threads when stack-walking.

* Add signal async safe stack unwind option.

* Assert that walk_stack_full_llvm_only is only called in llvm only mode.

* Correct some bool usage.

* Make signal safe unwind option, signal async safe.

Mono's current MONO_UNWIND_SIGNAL_SAFE was not fully signal safe since
it was not async safe, that could lead to taking loader lock. This
will fix MONO_UNWIND_SIGNAL_SAFE to be signal asycn safe, it will also
change current use of MONO_UNWIND_SIGNAL_SAFE to MONO_UNWIND_NONE since
they where equal before this fix, meaning old calls using
MONO_UNWIND_SIGNAL_SAFE will behave identical using MONO_UNWIND_NONE
so no regression.
medhatiwari pushed a commit to medhatiwari/runtime that referenced this issue Mar 20, 2025
…#113645)

* [Mono]: Fix additional stack-walks to be async safe.

Mono's stack-walker can be signal/async safe when needed,
making sure it won't allocate additional memory or taking internal
runtime locks. It is however up to the caller of the stack walking API's
to decide if it should be signal and/or async safe. Currently this is
controlled using two different flags, the MONO_UNWIND_SIGNAL_SAFE as well
as mono_thread_info_is_async_context.

This is problematic since callers wants signal safe stack-walking but
since not both are set, it will not behave fully signal safe.

dotnet/android#9365 hit a couple of scenarios
described here:

dotnet/android#9365 (comment)

that ends up deadlocking due to the fact that they did stack-walking
from within a signal handler and deadlocked or dumping stack on suspended
thread holding runtime loader lock, but without making the stack-walk async safe.

Fix makes sure that calls to stack-walk API's can be made signal and/or
async safe and that identified areas uses the correct set of flags given
state of threads when stack-walking.

* Add signal async safe stack unwind option.

* Assert that walk_stack_full_llvm_only is only called in llvm only mode.

* Correct some bool usage.

* Make signal safe unwind option, signal async safe.

Mono's current MONO_UNWIND_SIGNAL_SAFE was not fully signal safe since
it was not async safe, that could lead to taking loader lock. This
will fix MONO_UNWIND_SIGNAL_SAFE to be signal asycn safe, it will also
change current use of MONO_UNWIND_SIGNAL_SAFE to MONO_UNWIND_NONE since
they where equal before this fix, meaning old calls using
MONO_UNWIND_SIGNAL_SAFE will behave identical using MONO_UNWIND_NONE
so no regression.
@akoeplinger
Copy link
Member

akoeplinger commented Mar 20, 2025

hm that document is outdated, it talks about legacy Mono. I filed #9944.
The steps for mono in dotnet/runtime are a bit different (and I'd say simpler) and you don't need to rebuild dotnet/android.

@lateralusX
Copy link
Member

lateralusX commented Mar 20, 2025

I will follow up with details on Monday, it will be local mono dotnet/runtime Android build + use of override ResolvedRuntimePack as described here, https://github.com/dotnet/runtime/blob/main/docs/workflow/debugging/mono/android-debugging.md#native-debugging-using-a-local-debug-build-of-mono.

@lateralusX
Copy link
Member

@TimBurik, here are the steps need to build a local arm64 Android runtime with additional STW logging:

Clone dotnet/runtime repro and checkout a branch, for .net8, https://github.com/dotnet/runtime/tree/release/8.0, for .net9 https://github.com/dotnet/runtime/tree/release/9.0.

Patch mono-threads-debug.h:

--- a/src/mono/mono/utils/mono-threads-debug.h
+++ b/src/mono/mono/utils/mono-threads-debug.h
@@ -7,13 +7,22 @@
 
 #include <config.h>
 #include <glib.h>
+#if defined (HOST_ANDROID)
+#include <android/log.h>
+#endif
 
 /* Logging - enable them below if you need specific logging for the category you need */
 #define MOSTLY_ASYNC_SAFE_FPRINTF(handle, ...) do { \
        g_async_safe_fprintf (handle, __VA_ARGS__); \
 } while (0)
 
+#if defined (HOST_ANDROID)
+#define MOSTLY_ASYNC_SAFE_PRINTF(...) do { \
+       __android_log_print (ANDROID_LOG_ERROR, "THREAD", __VA_ARGS__); \
+} while (0)
+#else
 #define MOSTLY_ASYNC_SAFE_PRINTF(...) MOSTLY_ASYNC_SAFE_FPRINTF(1, __VA_ARGS__);
+#endif
 
 #if 1
 #define THREADS_DEBUG(...)
@@ -21,7 +30,7 @@
 #define THREADS_DEBUG MOSTLY_ASYNC_SAFE_PRINTF
 #endif
 
-#if 1
+#if 0
 #define THREADS_STW_DEBUG(...)
 #else
 #define THREADS_STW_DEBUG MOSTLY_ASYNC_SAFE_PRINTF
 

Build Android runtime locally, make sure to follow runtime pre-reqs, but ignore building runtime+tests using these instructions:

https://github.com/dotnet/runtime/blob/release/8.0/docs/workflow/testing/libraries/testing-android.md

Build runtime+libs using the following command line:

./build.sh mono+libs -os android -arch arm64 -c Release -keepnativesymbols true

Once the build completes, the local build runtime can be used when building regular Android using this change to the projects .csproj file:

https://github.com/dotnet/runtime/blob/main/docs/workflow/debugging/mono/android-debugging.md#native-debugging-using-a-local-debug-build-of-mono

Make sure to rebuild the app (delete previous obj/bin folders).

If all worked as expected, you should now see additional logging in logcat that is prefixed with "[STW-" and uses "THREAD" as logcat tag as soon as a GC gets triggered.

Once you have verified that you get needed logging in logcat, run the repro, collect logcat logging + ANR's, this should give additional information on what threads STW is waiting upon, their callstack and their current GC mode.

Keep this setup around, we might need additional logging in case we still won't be able to figure out why it happens with additional enabled logging.

@TimBurik
Copy link
Author

@lateralusX Thank you for detailed instructions, it was very helpful!

We did manage to build a local android runtime with patch you provided, and we do see additional STW messages in the logcat.
However, we also now experience crashes almost immediately at the start of the app:

  • app crashes only if local android runtime is used (replaced by <ResolvedRuntimePack>);
  • we have tried to disable parts of the app where crashes are happening (like crash reporter initialization at the early stages of the app startup), but the app keeps crashing in other, almost feels like random, places;
  • some of the crash logs looks "invalid", for example a managed stack trace with a call to the method, which is never called in the source code. Or a native SIGSEGV crash from instance being casted to an interface it doesn't implement;
  • it seems as if there's an issue with method address mappings - have you ever seen similar symptoms?

Example of the STW section in logcat:

E  [STW-BEGIN][0x7696d69cb0] *** BEGIN SUSPEND *** 
E  [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x765de5bcb0 skip false
E  [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x765e061cb0 skip false
E  [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x7661029cb0 skip false
E  [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x7663c86cb0 skip false
E  [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x7664ffccb0 skip false
E  [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x766fe29cb0 skip false
E  [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x7674b14cb0 skip false
E  [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x767ce8acb0 skip false
E  [STW-BEGIN-SUSPEND-0] IGNORE thread 0x7696d69cb0 skip false reason 3
E  [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x7a3d958cb0 skip false
E  [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x7a73c3b500 skip false
E  [STW-BEGIN-SUSPEND-1] SUSPEND thread 0x765de5bcb0 skip false
E  [STW-BEGIN-SUSPEND-1] SUSPEND thread 0x765e061cb0 skip false
E  [STW-BEGIN-SUSPEND-1] SUSPEND thread 0x7661029cb0 skip false
E  [STW-BEGIN-SUSPEND-1] SUSPEND thread 0x7663c86cb0 skip false
E  [STW-BEGIN-SUSPEND-1] SUSPEND thread 0x7664ffccb0 skip false
E  [STW-BEGIN-SUSPEND-1] SUSPEND thread 0x766fe29cb0 skip false
E  [STW-BEGIN-SUSPEND-1] SUSPEND thread 0x7674b14cb0 skip false
E  [STW-BEGIN-SUSPEND-1] SUSPEND thread 0x767ce8acb0 skip false
E  [STW-BEGIN-SUSPEND-1] SUSPEND thread 0x7a3d958cb0 skip false
E  [STW-BEGIN-SUSPEND-1] SUSPEND thread 0x7a73c3b500 skip false
E  [STW-RESTART] DONE thread 0x765de5bcb0 deemed fully suspended
E  [STW-RESTART] DONE thread 0x765e061cb0 deemed fully suspended
E  [STW-RESTART] DONE thread 0x7661029cb0 deemed fully suspended
E  [STW-RESTART] DONE thread 0x7663c86cb0 deemed fully suspended
E  [STW-RESTART] DONE thread 0x7664ffccb0 deemed fully suspended
E  [STW-RESTART] DONE thread 0x766fe29cb0 deemed fully suspended
E  [STW-RESTART] DONE thread 0x7674b14cb0 deemed fully suspended
E  [STW-RESTART] DONE thread 0x767ce8acb0 deemed fully suspended
E  [STW-RESTART] IGNORE RESUME thread 0x7696d69cb0 not been processed done 1 current 1 reason 0
E  [STW-RESTART] DONE thread 0x7a3d958cb0 deemed fully suspended
E  [STW-RESTART] DONE thread 0x7a73c3b500 deemed fully suspended
E  [STW-SUSPEND-END] thread 0x765de5bcb0 is suspended, stopped_ip = 0x7a33eecf1c, stack = 0x765de5af40 -> 0x765de5bcb0
E  [STW-SUSPEND-END] thread 0x765e061cb0 is suspended, stopped_ip = 0x7a33eecf20, stack = 0x765e061210 -> 0x765e061cb0
E  [STW-SUSPEND-END] thread 0x7661029cb0 is suspended, stopped_ip = 0x7a33eecf20, stack = 0x7661029210 -> 0x7661029cb0
E  [STW-SUSPEND-END] thread 0x7663c86cb0 is suspended, stopped_ip = 0x7a33eecf20, stack = 0x7663c86090 -> 0x7663c86cb0
E  [STW-SUSPEND-END] thread 0x7664ffccb0 is suspended, stopped_ip = 0x7a33eecf20, stack = 0x7664ffc310 -> 0x7664ffccb0
E  [STW-SUSPEND-END] thread 0x766fe29cb0 is suspended, stopped_ip = 0x7a33eecf20, stack = 0x766fe29210 -> 0x766fe29cb0
E  [STW-SUSPEND-END] thread 0x7674b14cb0 is suspended, stopped_ip = 0x7a33eecf20, stack = 0x7674b142b0 -> 0x7674b14cb0
E  [STW-SUSPEND-END] thread 0x767ce8acb0 is suspended, stopped_ip = 0x7a33eecf1c, stack = 0x767ce89ee0 -> 0x767ce8acb0
E  [STW-SUSPEND-END] thread 0x7696d69cb0 is NOT suspended, reason 3
E  [STW-SUSPEND-END] thread 0x7a3d958cb0 is suspended, stopped_ip = 0x7a33eecf20, stack = 0x7a3d9580b0 -> 0x7a3d958cb0
E  [STW-SUSPEND-END] thread 0x7a73c3b500 is suspended, stopped_ip = 0x7a33f428b8, stack = 0x7fc58b1680 -> 0x7fc58b5000
E  [STW-END] *** BEGIN RESUME ***
E  [STW-RESUME-WORLD] RESUME thread 0x765de5bcb0
E  [STW-RESUME-WORLD] RESUME thread 0x765e061cb0
E  [STW-RESUME-WORLD] RESUME thread 0x7661029cb0
E  [STW-RESUME-WORLD] RESUME thread 0x7663c86cb0
E  [STW-RESUME-WORLD] RESUME thread 0x7664ffccb0
E  [STW-RESUME-WORLD] RESUME thread 0x766fe29cb0
E  [STW-RESUME-WORLD] RESUME thread 0x7674b14cb0
E  [STW-RESUME-WORLD] RESUME thread 0x767ce8acb0
E  [STW-RESUME-WORLD] IGNORE thread 0x7696d69cb0, reason 3
E  [STW-RESUME-WORLD] RESUME thread 0x7a3d958cb0
E  [STW-RESUME-WORLD] RESUME thread 0x7a73c3b500

@lateralusX
Copy link
Member

lateralusX commented Mar 31, 2025

@TimBurik, unfortunately not. I build and consume it from a dotnet new android template app, but I only ran it on emulator, do you see the same issues on emulator and device? Did you build release version of both runtime and libraries? I assume you also build a runtime version matching the Android SDK version you are currently using building the app, correct? Maybe it would make sense for you to first try out the template app with local build runtime and see if that works as expected? If we end up not getting this to work together with your app, we could do another hack and only replace the libmonosgen.so directly in the Android runtime pack, that way only the local build runtime shared library will be updated, a release version of libmonosgen-2.0.so is API compatible with the dotnet Android SDK given the same dotnet version. If you want to take this approach you locate the local build libmonosgen-2.0.so in artifacts/bin/microsoft.netcore.app.runtime.android-arm64/Release/runtimes/android-arm64/native and replace the version included in the installed workload package used when building the Android app, find your dotnet runtime used when building the app (dotnet --list-runtimes), then replace libmonosgen-2.0.so in packs/Microsoft.NETCore.App.Runtime.Mono.android-arm64/[version]/runtimes/android-arm64/native (keep the old .so file so you can restore later). Then do a full rebuild of the app and make sure the libmonosgen-2.0.so file ending up in the APK is the new local build version.

@TimBurik
Copy link
Author

do you see the same issues on emulator and device?

yes, our application with local android runtime crashes at startup on both real device and emulator

Did you build release version of both runtime and libraries?

I have used the following command, I assume it builds both runtime and libs:

./build.sh mono+libs -os android -arch arm64 -c Release -keepnativesymbols true

Just in case, here is an full output of the build process: android_runtime_build_output.txt

I assume you also build a runtime version matching the Android SDK version you are currently using building the app, correct?

I did not find any settings related to Android SDK in the manual, except providing the path for SDK and NDK via environment variables. But I have been using the same Android SDK+NDK:

  • to build the app using default android runtime;
  • to build local android runtime;
  • to build the app using local android runtime;

Maybe it would make sense for you to first try out the template app with local build runtime and see if that works as expected?

I have checked the local android runtime with the simple android app (template + crash reporter initialization + button to trigger explicit nursery GC) and it works as expected both on device and emulator:

  • app doesn't crash neither at startup nor later;
  • STW logs appear in the logcat after GC;

Let me try the other approach with copying libmonosgen-2.0.so, hopefully it would work better with our app.

@lateralusX
Copy link
Member

I did not find any settings related to Android SDK in the manual, except providing the path for SDK and NDK via environment variables. But I have been using the same Android SDK+NDK:

  • to build the app using default android runtime;
  • to build local android runtime;
  • to build the app using local android runtime;

Sorry, I was not precise what I meant. I was referring to the version of dotnet Android SDK used when building the app, you need to make sure you build the runtime using the same version as expected by the dotnet Android SDK.

As said above, in this case I think we would need to go with minimal patching of libmonosgen-2.0.so to eliminate any other broken dependencies.

@TimBurik
Copy link
Author

@lateralusX sorry for the long response, but after a lot of trials and errors I finally managed to reproduce the ANR using the local android runtime)

I did manage to make the local runtime work by using https://github.com/dotnet/runtime/tree/v9.0.2 as a base, instead of https://github.com/dotnet/runtime/tree/release/9.0 (I was using dotnet/runtime@d9d0ae0 originally, which caused crashes at startup).

Regarding the reproduction itself, this is the last app-related logs we see in the logcat: ANR_reproduction_logcat_latest.txt
Here is the latest successful GC before the ANR, just in case: ANR_reproduction_logcat_last_succesful_GC.txt

From the first glance, it seems that the ANR is happening on the early stage of the STW, and it is hard to detect what is the cause based on these logs. Do you know about other logs we could enable to help us detect the root cause of the issue?
Previously we tried to use MONO_SLEEP_ABORT_LIMIT to crash the app when GC takes a lot of time, but didn't manage to extract any useful information

@lateralusX
Copy link
Member

lateralusX commented Apr 11, 2025

@TimBurik, fantastic that you managed to get things working and that you have a repro with additional logging. I hoped that our initial logging would allow us to identify the thread id's that we are waiting on and then correlate to the ANR report and callstacks, but looks like the tid used in ANR is the kernel id while we use the pthread thread id in the logging, so we would need to tweak that logging a little to include the kernel thread id so we can correlate the thread's in our logging to the threads in the ANR report.

The data collected so far gives us some more information, we block in the first phase of the STW, that is when we go over all attached threads state and if they are in RUNNING mode, STW will ask them to cooperate suspend and wait for completion, so that indicates that at least one of the threads in the list is still in RUNNING mode, probably blocked outside of runtime or managed code waiting on some external event. It is also worth noting that all threads included in the current STW was also part of previous successful STW, so at that point all threads managed to suspend successfully.

In order to get the thread kernel id included in the STW logging, please apply this patch on top of the patch already applied:

--- a/src/mono/mono/metadata/sgen-stw.c
+++ b/src/mono/mono/metadata/sgen-stw.c
@@ -296,6 +296,11 @@ is_thread_in_current_stw (SgenThreadInfo *info, int *reason)
 	return TRUE;
 }
 
+static gpointer get_native_tid (SgenThreadInfo *info)
+{
+	return (gpointer)(gsize)(info->client_info.info.native_handle);
+}
+
 static
 void
 unified_suspend_stop_world (MonoThreadInfoFlags flags, unified_suspend_thread_stopped_func thread_stopped_callback)
@@ -306,7 +311,7 @@ unified_suspend_stop_world (MonoThreadInfoFlags flags, unified_suspend_thread_st
 	g_assert (!mono_thread_info_will_not_safepoint (mono_thread_info_current ()));
 
 	mono_threads_begin_global_suspend ();
-	THREADS_STW_DEBUG ("[STW-BEGIN][%p] *** BEGIN SUSPEND *** \n", mono_thread_info_get_tid (mono_thread_info_current ()));
+	THREADS_STW_DEBUG ("[STW-BEGIN][%p [%p]] *** BEGIN SUSPEND *** \n", mono_thread_info_get_tid (mono_thread_info_current ()), get_native_tid (mono_thread_info_current ()));
 
 	for (MonoThreadSuspendPhase phase = MONO_THREAD_SUSPEND_PHASE_INITIAL; phase < MONO_THREAD_SUSPEND_PHASE_COUNT; phase++) {
 		gboolean need_next_phase = FALSE;
@@ -324,7 +329,7 @@ unified_suspend_stop_world (MonoThreadInfoFlags flags, unified_suspend_thread_st
 
 			int reason;
 			if (!is_thread_in_current_stw (info, &reason)) {
-				THREADS_STW_DEBUG ("[STW-BEGIN-SUSPEND-%d] IGNORE thread %p skip %s reason %d\n", (int)phase, mono_thread_info_get_tid (info), info->client_info.skip ? "true" : "false", reason);
+				THREADS_STW_DEBUG ("[STW-BEGIN-SUSPEND-%d] IGNORE thread %p [%p] skip %s reason %d\n", (int)phase, mono_thread_info_get_tid (info), get_native_tid (info), info->client_info.skip ? "true" : "false", reason);
 				continue;
 			}
 
@@ -342,7 +347,7 @@ unified_suspend_stop_world (MonoThreadInfoFlags flags, unified_suspend_thread_st
 				g_assert_not_reached ();
 			}
 
-			THREADS_STW_DEBUG ("[STW-BEGIN-SUSPEND-%d] SUSPEND thread %p skip %s\n", (int)phase, mono_thread_info_get_tid (info), info->client_info.skip ? "true" : "false");
+			THREADS_STW_DEBUG ("[STW-BEGIN-SUSPEND-%d] SUSPEND thread %p [%p] skip %s\n", (int)phase, mono_thread_info_get_tid (info), get_native_tid (info), info->client_info.skip ? "true" : "false");
 		} FOREACH_THREAD_END;
 
 		mono_thread_info_current ()->client_info.suspend_done = TRUE;
@@ -360,7 +365,7 @@ unified_suspend_stop_world (MonoThreadInfoFlags flags, unified_suspend_thread_st
 
 			int reason = 0;
 			if (info->client_info.suspend_done || !is_thread_in_current_stw (info, &reason)) {
-				THREADS_STW_DEBUG ("[STW-RESTART] IGNORE RESUME thread %p not been processed done %d current %d reason %d\n", mono_thread_info_get_tid (info), info->client_info.suspend_done, !is_thread_in_current_stw (info, NULL), reason);
+				THREADS_STW_DEBUG ("[STW-RESTART] IGNORE RESUME thread %p [%p] not been processed done %d current %d reason %d\n", mono_thread_info_get_tid (info), get_native_tid (info), info->client_info.suspend_done, !is_thread_in_current_stw (info, NULL), reason);
 				continue;
 			}
 
@@ -373,7 +378,7 @@ unified_suspend_stop_world (MonoThreadInfoFlags flags, unified_suspend_thread_st
 			if (!mono_thread_info_in_critical_location (info)) {
 				info->client_info.suspend_done = TRUE;
 
-				THREADS_STW_DEBUG ("[STW-RESTART] DONE thread %p deemed fully suspended\n", mono_thread_info_get_tid (info));
+				THREADS_STW_DEBUG ("[STW-RESTART] DONE thread %p [%p] deemed fully suspended\n", mono_thread_info_get_tid (info), get_native_tid (info));
 				continue;
 			}
 
@@ -385,7 +390,7 @@ unified_suspend_stop_world (MonoThreadInfoFlags flags, unified_suspend_thread_st
 			if (!info->client_info.skip)
 				restart_counter += 1;
 
-			THREADS_STW_DEBUG ("[STW-RESTART] RESTART thread %p skip %s\n", mono_thread_info_get_tid (info), info->client_info.skip ? "true" : "false");
+			THREADS_STW_DEBUG ("[STW-RESTART] RESTART thread %p [%p] skip %s\n", mono_thread_info_get_tid (info), get_native_tid (info), info->client_info.skip ? "true" : "false");
 		} FOREACH_THREAD_END
 
 		mono_threads_wait_pending_operations ();
@@ -404,12 +409,12 @@ unified_suspend_stop_world (MonoThreadInfoFlags flags, unified_suspend_thread_st
 		FOREACH_THREAD_EXCLUDE (info, flags) {
 			int reason = 0;
 			if (info->client_info.suspend_done || !is_thread_in_current_stw (info, &reason)) {
-				THREADS_STW_DEBUG ("[STW-RESTART] IGNORE SUSPEND thread %p not been processed done %d current %d reason %d\n", mono_thread_info_get_tid (info), info->client_info.suspend_done, !is_thread_in_current_stw (info, NULL), reason);
+				THREADS_STW_DEBUG ("[STW-RESTART] IGNORE SUSPEND thread %p [%p] not been processed done %d current %d reason %d\n", mono_thread_info_get_tid (info), get_native_tid (info), info->client_info.suspend_done, !is_thread_in_current_stw (info, NULL), reason);
 				continue;
 			}
 
 			if (!mono_thread_info_is_running (info)) {
-				THREADS_STW_DEBUG ("[STW-RESTART] IGNORE SUSPEND thread %p not running\n", mono_thread_info_get_tid (info));
+				THREADS_STW_DEBUG ("[STW-RESTART] IGNORE SUSPEND thread %p [%p] not running\n", mono_thread_info_get_tid (info), get_native_tid (info));
 				continue;
 			}
 
@@ -426,7 +431,7 @@ unified_suspend_stop_world (MonoThreadInfoFlags flags, unified_suspend_thread_st
 				g_assert_not_reached ();
 			}
 
-			THREADS_STW_DEBUG ("[STW-RESTART] SUSPEND thread %p skip %s\n", mono_thread_info_get_tid (info), info->client_info.skip ? "true" : "false");
+			THREADS_STW_DEBUG ("[STW-RESTART] SUSPEND thread %p [%p] skip %s\n", mono_thread_info_get_tid (info), get_native_tid (info), info->client_info.skip ? "true" : "false");
 		} FOREACH_THREAD_END
 
 		mono_threads_wait_pending_operations ();
@@ -437,7 +442,7 @@ unified_suspend_stop_world (MonoThreadInfoFlags flags, unified_suspend_thread_st
 		if (!is_thread_in_current_stw (info, &reason)) {
 			g_assert (!info->client_info.suspend_done || info == mono_thread_info_current ());
 
-			THREADS_STW_DEBUG ("[STW-SUSPEND-END] thread %p is NOT suspended, reason %d\n", mono_thread_info_get_tid (info), reason);
+			THREADS_STW_DEBUG ("[STW-SUSPEND-END] thread %p [%p] is NOT suspended, reason %d\n", mono_thread_info_get_tid (info), get_native_tid (info), reason);
 			continue;
 		}
 
@@ -446,8 +451,9 @@ unified_suspend_stop_world (MonoThreadInfoFlags flags, unified_suspend_thread_st
 		if (thread_stopped_callback)
 			thread_stopped_callback (info);
 
-		THREADS_STW_DEBUG ("[STW-SUSPEND-END] thread %p is suspended, stopped_ip = %p, stack = %p -> %p\n",
+		THREADS_STW_DEBUG ("[STW-SUSPEND-END] thread %p [%p] is suspended, stopped_ip = %p, stack = %p -> %p\n",
 			mono_thread_info_get_tid (info),
+			get_native_tid (info),
 			(gpointer)(MONO_CONTEXT_GET_IP (&(mono_thread_info_get_suspend_state (info)->ctx))),
 			info->client_info.stack_start ? info->client_info.stack_start : NULL, info->client_info.stack_start ? info->client_info.info.stack_end : NULL);
 
@@ -462,11 +468,11 @@ unified_suspend_restart_world (MonoThreadInfoFlags flags, unified_suspend_thread
 		int reason = 0;
 		if (is_thread_in_current_stw (info, &reason)) {
 			g_assert (mono_thread_info_begin_resume (info));
-			THREADS_STW_DEBUG ("[STW-RESUME-WORLD] RESUME thread %p\n", mono_thread_info_get_tid (info));
+			THREADS_STW_DEBUG ("[STW-RESUME-WORLD] RESUME thread %p [%p]\n", mono_thread_info_get_tid (info), get_native_tid (info));
 			if (thread_restarted_callback)
 				thread_restarted_callback (info);
 		} else {
-			THREADS_STW_DEBUG ("[STW-RESUME-WORLD] IGNORE thread %p, reason %d\n", mono_thread_info_get_tid (info), reason);
+			THREADS_STW_DEBUG ("[STW-RESUME-WORLD] IGNORE thread %p [%p], reason %d\n", mono_thread_info_get_tid (info), get_native_tid (info), reason);
 		}
 	} FOREACH_THREAD_END

We could increase the STW logging to get even more details on the suspend machinery, we just don't want to flood with logging that could preventing the issue from reproducing. In order to enable more STW logging (but not all) you can apply this patch as well:

--- a/src/mono/mono/utils/mono-threads-debug.h
+++ b/src/mono/mono/utils/mono-threads-debug.h
@@ -36,7 +36,7 @@
 #define THREADS_STW_DEBUG MOSTLY_ASYNC_SAFE_PRINTF
 #endif

-#if 1
+#if 0
 #define THREADS_SUSPEND_DEBUG(...)
 #else
 #define THREADS_SUSPEND_DEBUG MOSTLY_ASYNC_SAFE_PRINTF

If the patches won't apply clean (they are generated from main), it should be straightforward to do the changes manually.

When you repro the ANR with the additional logging, please provide the logging + the related ANR report (symbolicate), so we can figure out what threads we currently have in our STW and where they are currently blocked as reported by the callstacks in the ANR report.

Thank you so much for assisting on this, being able to run a repro on this issue with additional logging is fantastic. Looking forward to the results.

@TimBurik
Copy link
Author

@lateralusX thank you for your help)

I have applied both of the patches from your latest message and right away managed to catch a freeze, which later was reported by the system as ANR due to input response timeout. Although this freeze/ANR was not reproduced using our usual reproduction steps, and thus might be a different unrelated issue, I have collected all available data just in case:

Meanwhile let me try to also reproduce the freeze/ANR using our usual reproduction steps, just to make sure if this is the same issue or not

@lateralusX
Copy link
Member

lateralusX commented Apr 11, 2025

@TimBurik Took a quick first glance and it looks like it deadlock on the additional logging, its using __android_log_write_log_message that is not fully async-signal safe and since the additional logging could happen on other thread now, that might get suspended, it could deadlock. I can look into making the logging async-signal safe, but maybe we should just go with the limited set of logging first, with the new ability to track thread to ANR report, so just back out:

--- a/src/mono/mono/utils/mono-threads-debug.h
+++ b/src/mono/mono/utils/mono-threads-debug.h
@@ -36,7 +36,7 @@
 #define THREADS_STW_DEBUG MOSTLY_ASYNC_SAFE_PRINTF
 #endif

-#if 1
+#if 0
 #define THREADS_SUSPEND_DEBUG(...)
 #else
 #define THREADS_SUSPEND_DEBUG MOSTLY_ASYNC_SAFE_PRINTF

and run again and see if it repros in the same way as it originally did.

@TimBurik
Copy link
Author

@lateralusX I've disabled additional THREADS_SUSPEND_DEBUG logging and managed to reproduce the freeze/ANR using our known reproduction steps, the same case as in #9365 (comment). Even though the app was frozen solid, I had a hard time making Android OS recognize it as ANR (show the popup), that's why the ANR dump was collected after ~8 minutes after freeze actually happened.

Here's all the information I managed to collect:

@lateralusX
Copy link
Member

lateralusX commented Apr 14, 2025

@TimBurik, fantastic, I listed the threads that are attached to MonoVM in state and appears to in RUNNING (we probably wait for them in first phase of STW) and they all seem to be waiting on external events and won't reach a safe point in managed code, blocking the STW. Since we didn't get the more detailed thread output working yet, we don't know exactly what threads from the list below that is already in BLOCKING mode vs RUNNING mode. A thread attached to the runtime in RUNNING mode (mono_thread_attach) can't wait on external events like the threads below.

  • Do you attach any threads on your own using the Mono embedding API's?
  • Do you use MarshalMethods in this repro?

If you are running 9.0.2, MarshalMethods should attach threads using mono_jit_thread_attach, fix was done here, 27b5d2e, otherwise they ended up attaching threads running in the Java thread pool in a way no compatible with hybrid suspend model. It might be worth disable them and see if things still repro.

Next step is to see if we could get some more logging, at least on the STW thread to identify what threads that are still in running vs blocking state as part of first STW phase when we hit ANR, maybe fixup the dump thread to work on a schedule, like dump each second for the first couple of seconds waiting.

Since we shouldn't call mono_thread_attach on these threads, we could also include an assert if that happens and we will get an abort + callstack from where the call is coming.

Waiting on the thread pool on new work:

"queue-1-2" sysTid=10620
    #00 pc 0000000000055f1c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #01 pc 00000000004274bc  /apex/com.android.art/lib64/libart.so (art::Thread::Park(bool, long)+668) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #02 pc 0000000000426f10  /apex/com.android.art/lib64/libart.so (art::Unsafe_park(_JNIEnv*, _jobject*, unsigned char, long) (.__uniq.319429422067363160645159987129209045680)+164) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #03 pc 00000000026bc34c  /memfd:jit-cache (deleted) (offset 0x2000000) (art_jni_trampoline+140)
    #04 pc 0000000002250a30  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.LockSupport.park+144)
    #05 pc 00000000022508d8  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block+136)
    #06 pc 00000000022505c4  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ForkJoinPool.unmanagedBlock+164)
    #07 pc 0000000002250418  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ForkJoinPool.managedBlock+232)
    #08 pc 000000000224f30c  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await+412)
    #09 pc 000000000224ef54  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.LinkedBlockingQueue.take+228)
    #10 pc 0000000002205a20  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor.getTask+608)
    #11 pc 00000000021c5300  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor.runWorker+192)
    #12 pc 00000000021c5158  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor$Worker.run+136)
    #13 pc 0000000002004a70  /memfd:jit-cache (deleted) (offset 0x2000000) (java.lang.Thread.run+144)
    #14 pc 0000000000368774  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+612) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #15 pc 0000000000353f24  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+132) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #16 pc 0000000000947968  /apex/com.android.art/lib64/libart.so (_ZN3art9ArtMethod14InvokeInstanceILc86ETpTncJEEENS_6detail12ShortyTraitsIXT_EE4TypeEPNS_6ThreadENS_6ObjPtrINS_6mirror6ObjectEEEDpNS3_IXT0_EE4TypeE+60) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #17 pc 0000000000636878  /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1344) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #18 pc 00000000000be908  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #19 pc 000000000005b3f0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 2febda0bbcffa4b71545deb95b27767d)

Waiting on the thread pool on new work:

"PlayBillingLibr" sysTid=10636
    #00 pc 0000000000055f1c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #01 pc 00000000004274bc  /apex/com.android.art/lib64/libart.so (art::Thread::Park(bool, long)+668) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #02 pc 0000000000426f10  /apex/com.android.art/lib64/libart.so (art::Unsafe_park(_JNIEnv*, _jobject*, unsigned char, long) (.__uniq.319429422067363160645159987129209045680)+164) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #03 pc 00000000026bc34c  /memfd:jit-cache (deleted) (offset 0x2000000) (art_jni_trampoline+140)
    #04 pc 0000000002250a30  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.LockSupport.park+144)
    #05 pc 00000000022508d8  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block+136)
    #06 pc 00000000022505c4  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ForkJoinPool.unmanagedBlock+164)
    #07 pc 0000000002250418  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ForkJoinPool.managedBlock+232)
    #08 pc 000000000224f30c  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await+412)
    #09 pc 000000000224ef54  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.LinkedBlockingQueue.take+228)
    #10 pc 0000000002205a20  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor.getTask+608)
    #11 pc 00000000021c5300  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor.runWorker+192)
    #12 pc 00000000021c5158  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor$Worker.run+136)
    #13 pc 0000000002004a70  /memfd:jit-cache (deleted) (offset 0x2000000) (java.lang.Thread.run+144)
    #14 pc 0000000000368774  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+612) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #15 pc 0000000000353f24  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+132) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #16 pc 0000000000947968  /apex/com.android.art/lib64/libart.so (_ZN3art9ArtMethod14InvokeInstanceILc86ETpTncJEEENS_6detail12ShortyTraitsIXT_EE4TypeEPNS_6ThreadENS_6ObjPtrINS_6mirror6ObjectEEEDpNS3_IXT0_EE4TypeE+60) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #17 pc 0000000000636878  /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1344) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #18 pc 00000000000be908  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #19 pc 000000000005b3f0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 2febda0bbcffa4b71545deb95b27767d)

Waiting on the thread pool on new work:

"PlayBillingLibr" sysTid=10723
    #00 pc 0000000000055f1c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #01 pc 00000000004274bc  /apex/com.android.art/lib64/libart.so (art::Thread::Park(bool, long)+668) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #02 pc 0000000000426f10  /apex/com.android.art/lib64/libart.so (art::Unsafe_park(_JNIEnv*, _jobject*, unsigned char, long) (.__uniq.319429422067363160645159987129209045680)+164) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #03 pc 00000000026bc34c  /memfd:jit-cache (deleted) (offset 0x2000000) (art_jni_trampoline+140)
    #04 pc 0000000002250a30  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.LockSupport.park+144)
    #05 pc 00000000022508d8  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block+136)
    #06 pc 00000000022505c4  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ForkJoinPool.unmanagedBlock+164)
    #07 pc 0000000002250418  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ForkJoinPool.managedBlock+232)
    #08 pc 000000000224f30c  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await+412)
    #09 pc 000000000224ef54  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.LinkedBlockingQueue.take+228)
    #10 pc 0000000002205a20  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor.getTask+608)
    #11 pc 00000000021c5300  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor.runWorker+192)
    #12 pc 00000000021c5158  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor$Worker.run+136)
    #13 pc 0000000002004a70  /memfd:jit-cache (deleted) (offset 0x2000000) (java.lang.Thread.run+144)
    #14 pc 0000000000368774  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+612) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #15 pc 0000000000353f24  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+132) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #16 pc 0000000000947968  /apex/com.android.art/lib64/libart.so (_ZN3art9ArtMethod14InvokeInstanceILc86ETpTncJEEENS_6detail12ShortyTraitsIXT_EE4TypeEPNS_6ThreadENS_6ObjPtrINS_6mirror6ObjectEEEDpNS3_IXT0_EE4TypeE+60) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #17 pc 0000000000636878  /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1344) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #18 pc 00000000000be908  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #19 pc 000000000005b3f0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 2febda0bbcffa4b71545deb95b27767d)

Waiting on the thread pool on new work:

"PlayBillingLibr" sysTid=10724
    #00 pc 0000000000055f1c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #01 pc 00000000004274bc  /apex/com.android.art/lib64/libart.so (art::Thread::Park(bool, long)+668) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #02 pc 0000000000426f10  /apex/com.android.art/lib64/libart.so (art::Unsafe_park(_JNIEnv*, _jobject*, unsigned char, long) (.__uniq.319429422067363160645159987129209045680)+164) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #03 pc 00000000026bc34c  /memfd:jit-cache (deleted) (offset 0x2000000) (art_jni_trampoline+140)
    #04 pc 0000000002250a30  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.LockSupport.park+144)
    #05 pc 00000000022508d8  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block+136)
    #06 pc 00000000022505c4  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ForkJoinPool.unmanagedBlock+164)
    #07 pc 0000000002250418  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ForkJoinPool.managedBlock+232)
    #08 pc 000000000224f30c  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await+412)
    #09 pc 000000000224ef54  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.LinkedBlockingQueue.take+228)
    #10 pc 0000000002205a20  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor.getTask+608)
    #11 pc 00000000021c5300  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor.runWorker+192)
    #12 pc 00000000021c5158  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor$Worker.run+136)
    #13 pc 0000000002004a70  /memfd:jit-cache (deleted) (offset 0x2000000) (java.lang.Thread.run+144)
    #14 pc 0000000000368774  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+612) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #15 pc 0000000000353f24  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+132) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #16 pc 0000000000947968  /apex/com.android.art/lib64/libart.so (_ZN3art9ArtMethod14InvokeInstanceILc86ETpTncJEEENS_6detail12ShortyTraitsIXT_EE4TypeEPNS_6ThreadENS_6ObjPtrINS_6mirror6ObjectEEEDpNS3_IXT0_EE4TypeE+60) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #17 pc 0000000000636878  /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1344) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #18 pc 00000000000be908  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #19 pc 000000000005b3f0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 2febda0bbcffa4b71545deb95b27767d)

Waiting on the thread pool on new work:

"PlayBillingLibr" sysTid=10761
    #00 pc 0000000000055f1c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #01 pc 00000000004274bc  /apex/com.android.art/lib64/libart.so (art::Thread::Park(bool, long)+668) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #02 pc 0000000000426f10  /apex/com.android.art/lib64/libart.so (art::Unsafe_park(_JNIEnv*, _jobject*, unsigned char, long) (.__uniq.319429422067363160645159987129209045680)+164) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #03 pc 00000000026bc34c  /memfd:jit-cache (deleted) (offset 0x2000000) (art_jni_trampoline+140)
    #04 pc 0000000002250a30  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.LockSupport.park+144)
    #05 pc 00000000022508d8  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block+136)
    #06 pc 00000000022505c4  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ForkJoinPool.unmanagedBlock+164)
    #07 pc 0000000002250418  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ForkJoinPool.managedBlock+232)
    #08 pc 000000000224f30c  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await+412)
    #09 pc 000000000224ef54  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.LinkedBlockingQueue.take+228)
    #10 pc 0000000002205a20  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor.getTask+608)
    #11 pc 00000000021c5300  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor.runWorker+192)
    #12 pc 00000000021c5158  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor$Worker.run+136)
    #13 pc 0000000002004a70  /memfd:jit-cache (deleted) (offset 0x2000000) (java.lang.Thread.run+144)
    #14 pc 0000000000368774  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+612) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #15 pc 0000000000353f24  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+132) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #16 pc 0000000000947968  /apex/com.android.art/lib64/libart.so (_ZN3art9ArtMethod14InvokeInstanceILc86ETpTncJEEENS_6detail12ShortyTraitsIXT_EE4TypeEPNS_6ThreadENS_6ObjPtrINS_6mirror6ObjectEEEDpNS3_IXT0_EE4TypeE+60) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #17 pc 0000000000636878  /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1344) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #18 pc 00000000000be908  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #19 pc 000000000005b3f0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 2febda0bbcffa4b71545deb95b27767d)

Waiting on the thread pool on new work:

"PlayBillingLibr" sysTid=10762
    #00 pc 0000000000055f1c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #01 pc 00000000004274bc  /apex/com.android.art/lib64/libart.so (art::Thread::Park(bool, long)+668) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #02 pc 0000000000426f10  /apex/com.android.art/lib64/libart.so (art::Unsafe_park(_JNIEnv*, _jobject*, unsigned char, long) (.__uniq.319429422067363160645159987129209045680)+164) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #03 pc 00000000026bc34c  /memfd:jit-cache (deleted) (offset 0x2000000) (art_jni_trampoline+140)
    #04 pc 0000000002250a30  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.LockSupport.park+144)
    #05 pc 00000000022508d8  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block+136)
    #06 pc 00000000022505c4  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ForkJoinPool.unmanagedBlock+164)
    #07 pc 0000000002250418  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ForkJoinPool.managedBlock+232)
    #08 pc 000000000224f30c  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await+412)
    #09 pc 000000000224ef54  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.LinkedBlockingQueue.take+228)
    #10 pc 0000000002205a20  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor.getTask+608)
    #11 pc 00000000021c5300  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor.runWorker+192)
    #12 pc 00000000021c5158  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor$Worker.run+136)
    #13 pc 0000000002004a70  /memfd:jit-cache (deleted) (offset 0x2000000) (java.lang.Thread.run+144)
    #14 pc 0000000000368774  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+612) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #15 pc 0000000000353f24  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+132) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #16 pc 0000000000947968  /apex/com.android.art/lib64/libart.so (_ZN3art9ArtMethod14InvokeInstanceILc86ETpTncJEEENS_6detail12ShortyTraitsIXT_EE4TypeEPNS_6ThreadENS_6ObjPtrINS_6mirror6ObjectEEEDpNS3_IXT0_EE4TypeE+60) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #17 pc 0000000000636878  /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1344) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #18 pc 00000000000be908  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #19 pc 000000000005b3f0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 2febda0bbcffa4b71545deb95b27767d)

Waiting on the thread pool on new work:

"SentryAsyncConn" sysTid=10516
    #00 pc 0000000000055f1c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #01 pc 00000000004274bc  /apex/com.android.art/lib64/libart.so (art::Thread::Park(bool, long)+668) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #02 pc 0000000000426f10  /apex/com.android.art/lib64/libart.so (art::Unsafe_park(_JNIEnv*, _jobject*, unsigned char, long) (.__uniq.319429422067363160645159987129209045680)+164) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #03 pc 00000000026bc34c  /memfd:jit-cache (deleted) (offset 0x2000000) (art_jni_trampoline+140)
    #04 pc 0000000002250a30  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.LockSupport.park+144)
    #05 pc 00000000022508d8  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block+136)
    #06 pc 00000000022505c4  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ForkJoinPool.unmanagedBlock+164)
    #07 pc 0000000002250418  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ForkJoinPool.managedBlock+232)
    #08 pc 000000000224f30c  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await+412)
    #09 pc 000000000224ef54  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.LinkedBlockingQueue.take+228)
    #10 pc 0000000002205a20  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor.getTask+608)
    #11 pc 00000000021c5300  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor.runWorker+192)
    #12 pc 00000000021c5158  /memfd:jit-cache (deleted) (offset 0x2000000) (java.util.concurrent.ThreadPoolExecutor$Worker.run+136)
    #13 pc 0000000002004a70  /memfd:jit-cache (deleted) (offset 0x2000000) (java.lang.Thread.run+144)
    #14 pc 0000000000368774  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+612) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #15 pc 0000000000353f24  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+132) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #16 pc 0000000000947968  /apex/com.android.art/lib64/libart.so (_ZN3art9ArtMethod14InvokeInstanceILc86ETpTncJEEENS_6detail12ShortyTraitsIXT_EE4TypeEPNS_6ThreadENS_6ObjPtrINS_6mirror6ObjectEEEDpNS3_IXT0_EE4TypeE+60) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #17 pc 0000000000636878  /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1344) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #18 pc 00000000000be908  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #19 pc 000000000005b3f0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 2febda0bbcffa4b71545deb95b27767d)

Waiting on the thread pool on new work:

"ConnectivityThr" sysTid=10514
    #00 pc 00000000000ab8b8  /apex/com.android.runtime/lib64/bionic/libc.so (__epoll_pwait+8) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #01 pc 0000000000017edc  /system/lib64/libutils.so (android::Looper::pollInner(int)+188) (BuildId: 349e8a707584d82e0226d35def83e3d0)
    #02 pc 0000000000017dc0  /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+120) (BuildId: 349e8a707584d82e0226d35def83e3d0)
    #03 pc 0000000000184494  /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44) (BuildId: 4b6c592933d9e19dac2d846de01f461b)
    #04 pc 00000000026b237c  /memfd:jit-cache (deleted) (offset 0x2000000) (art_jni_trampoline+140)
    #05 pc 000000000207ef44  /memfd:jit-cache (deleted) (offset 0x2000000) (android.os.MessageQueue.next+356)
    #06 pc 000000000207e19c  /memfd:jit-cache (deleted) (offset 0x2000000) (android.os.Looper.loopOnce+172)
    #07 pc 000000000207cd44  /memfd:jit-cache (deleted) (offset 0x2000000) (android.os.Looper.loop+532)
    #08 pc 00000000022ad288  /memfd:jit-cache (deleted) (offset 0x2000000) (android.os.HandlerThread.run+312)
    #09 pc 0000000000368774  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+612) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #10 pc 0000000000353f24  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+132) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #11 pc 0000000000947968  /apex/com.android.art/lib64/libart.so (_ZN3art9ArtMethod14InvokeInstanceILc86ETpTncJEEENS_6detail12ShortyTraitsIXT_EE4TypeEPNS_6ThreadENS_6ObjPtrINS_6mirror6ObjectEEEDpNS3_IXT0_EE4TypeE+60) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #12 pc 0000000000636878  /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1344) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #13 pc 00000000000be908  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #14 pc 000000000005b3f0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 2febda0bbcffa4b71545deb95b27767d)

?? :

"GLThread 184" sysTid=10494
    #00 pc 0000000000055f1c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #01 pc 000000000022a29c  /apex/com.android.art/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+136) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #02 pc 00000000004244dc  /apex/com.android.art/lib64/libart.so (art::Monitor::Wait(art::Thread*, art::ObjPtr<art::mirror::Object>, long, int, bool, art::ThreadState)+7244) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #03 pc 00000000026af104  /memfd:jit-cache (deleted) (offset 0x2000000) (art_jni_trampoline+116)
    #04 pc 000000000200bb8c  /memfd:jit-cache (deleted) (offset 0x2000000) (java.lang.Object.wait+124)
    #05 pc 0000000002027294  /memfd:jit-cache (deleted) (offset 0x2000000) (java.lang.Object.wait+116)
    #06 pc 00000000027aea48  /memfd:jit-cache (deleted) (offset 0x2000000) (android.opengl.GLSurfaceView$GLThread.guardedRun+1448)
    #07 pc 0000000000780f44  /apex/com.android.art/lib64/libart.so (nterp_helper+3924) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #08 pc 00000000001d2ac8  /system/framework/framework.jar (android.opengl.GLSurfaceView$GLThread.run+52)
    #09 pc 0000000000368774  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+612) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #10 pc 0000000000353f24  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+132) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #11 pc 0000000000947968  /apex/com.android.art/lib64/libart.so (_ZN3art9ArtMethod14InvokeInstanceILc86ETpTncJEEENS_6detail12ShortyTraitsIXT_EE4TypeEPNS_6ThreadENS_6ObjPtrINS_6mirror6ObjectEEEDpNS3_IXT0_EE4TypeE+60) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #12 pc 0000000000636878  /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1344) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #13 pc 00000000000be908  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #14 pc 000000000005b3f0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 2febda0bbcffa4b71545deb95b27767d)

Waiting on finalizer queue:

"FinalizerDaemon" sysTid=10374
    #00 pc 0000000000055f1c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #01 pc 000000000022a29c  /apex/com.android.art/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+136) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #02 pc 00000000004244dc  /apex/com.android.art/lib64/libart.so (art::Monitor::Wait(art::Thread*, art::ObjPtr<art::mirror::Object>, long, int, bool, art::ThreadState)+7244) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #03 pc 00000000026af104  /memfd:jit-cache (deleted) (offset 0x2000000) (art_jni_trampoline+116)
    #04 pc 000000000200bb8c  /memfd:jit-cache (deleted) (offset 0x2000000) (java.lang.Object.wait+124)
    #05 pc 000000000200aaac  /memfd:jit-cache (deleted) (offset 0x2000000) (java.lang.ref.ReferenceQueue.remove+364)
    #06 pc 000000000200a884  /memfd:jit-cache (deleted) (offset 0x2000000) (java.lang.ref.ReferenceQueue.remove+116)
    #07 pc 00000000026ed5e8  /memfd:jit-cache (deleted) (offset 0x2000000) (java.lang.Daemons$FinalizerDaemon.runInternal+472)
    #08 pc 0000000000368ad0  /apex/com.android.art/lib64/libart.so (art_quick_osr_stub+64) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #09 pc 0000000000352b2c  /apex/com.android.art/lib64/libart.so (art::jit::Jit::MaybeDoOnStackReplacement(art::Thread*, art::ArtMethod*, unsigned int, int, art::JValue*)+492) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #10 pc 000000000076dabc  /apex/com.android.art/lib64/libart.so (void art::interpreter::ExecuteSwitchImplCpp<false>(art::interpreter::SwitchImplContext*)+4920) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #11 pc 00000000003815d8  /apex/com.android.art/lib64/libart.so (ExecuteSwitchImplAsm+8) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #12 pc 000000000002b6dc  /apex/com.android.art/javalib/core-libart.jar (java.lang.Daemons$FinalizerDaemon.runInternal)
    #13 pc 0000000000354250  /apex/com.android.art/lib64/libart.so (art::interpreter::Execute(art::Thread*, art::CodeItemDataAccessor const&, art::ShadowFrame&, art::JValue, bool, bool) (.__uniq.112435418011751916792819755956732575238.llvm.16344227474899595276)+428) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #14 pc 0000000000362678  /apex/com.android.art/lib64/libart.so (bool art::interpreter::DoCall<false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, unsigned short, bool, art::JValue*)+4124) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #15 pc 000000000076f734  /apex/com.android.art/lib64/libart.so (void art::interpreter::ExecuteSwitchImplCpp<false>(art::interpreter::SwitchImplContext*)+12208) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #16 pc 00000000003815d8  /apex/com.android.art/lib64/libart.so (ExecuteSwitchImplAsm+8) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #17 pc 000000000002b314  /apex/com.android.art/javalib/core-libart.jar (java.lang.Daemons$Daemon.run)
    #18 pc 0000000000354250  /apex/com.android.art/lib64/libart.so (art::interpreter::Execute(art::Thread*, art::CodeItemDataAccessor const&, art::ShadowFrame&, art::JValue, bool, bool) (.__uniq.112435418011751916792819755956732575238.llvm.16344227474899595276)+428) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #19 pc 0000000000362678  /apex/com.android.art/lib64/libart.so (bool art::interpreter::DoCall<false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, unsigned short, bool, art::JValue*)+4124) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #20 pc 000000000076f734  /apex/com.android.art/lib64/libart.so (void art::interpreter::ExecuteSwitchImplCpp<false>(art::interpreter::SwitchImplContext*)+12208) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #21 pc 00000000003815d8  /apex/com.android.art/lib64/libart.so (ExecuteSwitchImplAsm+8) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #22 pc 0000000000117a28  /apex/com.android.art/javalib/core-oj.jar (java.lang.Thread.run)
    #23 pc 00000000003535dc  /apex/com.android.art/lib64/libart.so (artQuickToInterpreterBridge+1932) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #24 pc 000000000037f098  /apex/com.android.art/lib64/libart.so (art_quick_to_interpreter_bridge+88) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #25 pc 0000000000368774  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+612) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #26 pc 0000000000353f24  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+132) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #27 pc 0000000000947968  /apex/com.android.art/lib64/libart.so (_ZN3art9ArtMethod14InvokeInstanceILc86ETpTncJEEENS_6detail12ShortyTraitsIXT_EE4TypeEPNS_6ThreadENS_6ObjPtrINS_6mirror6ObjectEEEDpNS3_IXT0_EE4TypeE+60) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #28 pc 0000000000636878  /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1344) (BuildId: a0856b3dbc826e8cee9d66738ac739d8)
    #29 pc 00000000000be908  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: 2febda0bbcffa4b71545deb95b27767d)
    #30 pc 000000000005b3f0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 2febda0bbcffa4b71545deb95b27767d)

@TimBurik
Copy link
Author

@lateralusX thank you for looking into this, and sorry for late response.

Do you attach any threads on your own using the Mono embedding API's?

No, and as far as I know we don't run any native threads ourselves, but some third-party libraries we use - might.

Do you use MarshalMethods in this repro?

I assume you referring to the <AndroidEnableMarshalMethods>. We don't specify the value of this flag manually, so I assume it is disabled by default. But I could enable/disable it manually for the next test, if needed.

Regarding the list of the threads, do I understand correctly that one/some of them are actually responsible for the ANR? I am going to try to identify those threads, but so far the one which is the easiest to identify is "SentryAsyncConn" sysTid=10516:
https://github.com/getsentry/sentry-java/blob/58e9007ac62454e5867fba7cd0070763c78e2e42/sentry/src/main/java/io/sentry/transport/AsyncHttpTransport.java#L217-L221
This is a part of the Sentry Java SDK, which is used as a binging by the Sentry .NET SDK, which then used by our app. We don't have control over when and how this thread is started, but I can imagine that at some point this thread might be executing a callback defined in managed code. Could such a scenario cause an issue during STW?

Meanwhile, if you have any ideas of how to further enhance logging and make it more informative for this case - I would be glad to test them out.

@lateralusX
Copy link
Member

lateralusX commented Apr 28, 2025

@TimBurik, regarding MarshalMethods, I believe it was disable in .net8, and the known issues should have been fixed in .net9 so should not be in play for this scenario, but if we would like to make sure that is not the case, it could make sense to disable it and see if issues still reproduce.

Regarding the threads, the threads I listed above are threads attached to the runtime so they will need to reach safepoints or already be in a GC safe region and ending up being preemptive suspend. The above threads are not preemptive suspended, so they are either already in a GC safe region or in RUNNING meaning they been attached in a way not compatible with waiting on external scheduler events like they do. We would need additional thread specific logging to know what state each thread currently has.

We could try to detect if we get threads attached using mono_thread_attach by logging some details in that method or just abort the app, if it happens and get a callstack on what code that still attaches using mono_thread_attach. Go to src\mono\mono\metadata\threads.c and mono_thread_attach function:

MonoThread *
mono_thread_attach (MonoDomain *domain)
{
	return mono_thread_attach_external_native_thread (domain, FALSE);
}

and change that to:

MonoThread *
mono_thread_attach (MonoDomain *domain)
{
        g_assert (FALSE);
	return mono_thread_attach_external_native_thread (domain, FALSE);
}

rebuild runtime, rebuild app with changed runtime and re-run repro.

So to summarize:

  • ReRun repro with AndroidEnableMarshalMethods set to false. Does the issue still repros?
  • ReRun the repro with above patch. Does the app crash on the assert? If so whats the callstack dumped in logcat?

I will prepare some more detailed thread logging unless above gives some more clues.

@TimBurik
Copy link
Author

@lateralusX I've just tested both of the cases with following results:

  • ANR is steel reproduced with AndroidEnableMarshalMethods=false;
  • App does not crash with additional g_assert (FALSE); in mono_thread_attach, at least from the launch of the application up until the ANR reproduction;

@lateralusX
Copy link
Member

@TimBurik, thanks for checking these scenarios so we can rule them out. Looks like we would need to try to extract more information from the threads to see if we can pinpoint what thread(s) that are not in either GC safe region (BLOCKING state) or in RUNNING but never reach a safepoint in timely manner. Lets start with the following patch (put on top of already implemented patches) to output state transition as part of first STW phase logging:

--- a/src/mono/mono/metadata/sgen-stw.c
+++ b/src/mono/mono/metadata/sgen-stw.c
@@ -296,6 +296,8 @@ is_thread_in_current_stw (SgenThreadInfo *info, int *reason)
        return TRUE;
 }

+const char * mono_threads_get_current_state_name (MonoThreadInfo *info);
+
 static
 void
 unified_suspend_stop_world (MonoThreadInfoFlags flags, unified_suspend_thread_stopped_func thread_stopped_callback)
@@ -324,10 +326,12 @@ unified_suspend_stop_world (MonoThreadInfoFlags flags, unified_suspend_thread_st

                        int reason;
                        if (!is_thread_in_current_stw (info, &reason)) {
-                               THREADS_STW_DEBUG ("[STW-BEGIN-SUSPEND-%d] IGNORE thread %p skip %s reason %d\n", (int)phase, mono_thread_info_get_tid (info), info->client_info.skip ? "true" : "false", reason);
+                               THREADS_STW_DEBUG ("[STW-BEGIN-SUSPEND-%d] IGNORE thread %p skip %s reason %d state %s\n", (int)phase, mono_thread_info_get_tid (info), info->client_info.skip ? "true" : "false", reason, mono_threads_get_current_state_name (&(info->client_info.info)));
                                continue;
                        }

+                       const char *previous_state = mono_threads_get_current_state_name (&(info->client_info.info));
+
                        switch (mono_thread_info_begin_suspend (info, phase)) {
                        case MONO_THREAD_BEGIN_SUSPEND_SUSPENDED:
                                info->client_info.skip = FALSE;
@@ -342,7 +346,9 @@ unified_suspend_stop_world (MonoThreadInfoFlags flags, unified_suspend_thread_st
                                g_assert_not_reached ();
                        }

-                       THREADS_STW_DEBUG ("[STW-BEGIN-SUSPEND-%d] SUSPEND thread %p skip %s\n", (int)phase, mono_thread_info_get_tid (info), info->client_info.skip ? "true" : "false");
+                       const char *current_state = mono_threads_get_current_state_name (&(info->client_info.info));
+
+                       THREADS_STW_DEBUG ("[STW-BEGIN-SUSPEND-%d] SUSPEND thread %p skip %s state %s->%s\n", (int)phase, mono_thread_info_get_tid (info), info->client_info.skip ? "true" : "false", previous_state, current_state);
                } FOREACH_THREAD_END;

                mono_thread_info_current ()->client_info.suspend_done = TRUE;

and

--- a/src/mono/mono/utils/mono-threads-state-machine.c
+++ b/src/mono/mono/utils/mono-threads-state-machine.c
@@ -180,6 +180,15 @@ trace_state_change (const char *transition, MonoThreadInfo *info, int cur_raw_st
        trace_state_change_with_func (transition, info, cur_raw_state, next_state, next_no_safepoints, suspend_count_delta, "");
 }

+const char *
+mono_threads_get_current_state_name (MonoThreadInfo *info)
+{
+       int raw_state, cur_state, suspend_count;
+       gboolean no_safepoints;
+       UNWRAP_THREAD_STATE (raw_state, cur_state, suspend_count, no_safepoints, info);
+       return state_name (cur_state);
+}
+
 /*
 This is the transition that signals that a thread is functioning.
 Its main goal is to catch threads been witnessed before been fully registered.

@TimBurik
Copy link
Author

TimBurik commented Apr 30, 2025

@lateralusX I guess we've found culprit, although a rather unexpected one, and I'm not sure what to make of it.

I reproduced the ANR with additional thread state information, and the last STW log looks like this:

04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN][0x73ab922500 [0x3787]] *** BEGIN SUSPEND *** 
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f2d4fbcb0 [0x3946] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f33f0acb0 [0x3937] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f4e151cb0 [0x3909] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f4f635cb0 [0x3908] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f5a05fcb0 [0x3936] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f5cbbfcb0 [0x38ba] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f648e7cb0 [0x38aa] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f6c9e2cb0 [0x389e] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f8b41acb0 [0x3863] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f8f5fbcb0 [0x3864] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f91221cb0 [0x3852] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f93848cb0 [0x384f] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f9599dcb0 [0x384e] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f9b225cb0 [0x3846] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f9d439cb0 [0x3843] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.544 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6f9fb73cb0 [0x3844] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.545 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6fa8ad6cb0 [0x3836] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.545 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6fb6026cb0 [0x382c] skip false state RUNNING->ASYNC_SUSPEND_REQUESTED
04-29 17:17:53.545 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6fb7620cb0 [0x3829] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.545 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x6fe805bcb0 [0x39eb] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.545 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x7072bb4cb0 [0x378e] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.545 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] SUSPEND thread 0x7385364cb0 [0x37f1] skip false state STATE_BLOCKING->STATE_BLOCKING_SUSPEND_REQUESTED
04-29 17:17:53.545 14215 14215 E THREAD  : [STW-BEGIN-SUSPEND-0] IGNORE thread 0x73ab922500 [0x3787] skip false reason 3 state RUNNING

I assume this means that this is the thread 0x382c (14380) is blocking the GC?

And in the ANR dump there's a record for this thread, but it looks rather strange:

".NET Long Runni" sysTid=14380
    #00 pc 000000000021e564  /data/app/~~BP3l4GdFh0DyKvNlMdk6wQ==/<app.bundle.id>---MYm7DTLAhW8Jj46icnrw==/lib/arm64/libaot-System.Private.CoreLib.dll.so

This is one of the long running threads used in our application (either by us directly, or by one of the third-partes), we should be able to figure out which one is it if needed. But I was under the impression that AOT libs, unlike usual native libs, don't have any instructions to execute and only contain JIT-precompiled code as data block. Could this single line in the thread dump mean that Mono runtime is truing to load this AOT lib into memory?

Just in case, here is the additional context I managed to collect:

I also worth pointing out, that I've made another reproduction of the ANR using the same steps, and it has all the same symptoms: one of the threads is in RUNNING->ASYNC_SUSPEND_REQUESTED state transition, this is a ".NET Long Running" thread with a single frame from the AOT lib. But the difference is that it is a different lib:

".NET Long Runni" sysTid=9805
    #00 pc 00000000002ab808  /data/app/~~BP3l4GdFh0DyKvNlMdk6wQ==/<app.bundle.id>---MYm7DTLAhW8Jj46icnrw==/lib/arm64/libaot-BB.Features.dll.so

where BB.Features is one of the projects in our solution.

I've also checked the earlier reproduction of the ANR, and it also contains a single-aot-frame long-running thread, although there's no evidence that this thread was blocking the GC in that case:

".NET Long Runni" sysTid=10497
    #00 pc 0000000000004648  /data/app/~~5jR1hF13xd75ietBwMowQQ==/<app.bundle.id>-JopwmmXJ4Oty8gB9uaMk5w==/lib/arm64/libaot-BB.Utilities.BCL.dll.so

@lateralusX
Copy link
Member

@TimBurik Nice! Yes, by looking at the new logging all threads are already in BLOCKING mode, except 0x382c and 0x3787 and 0x3787 is the thread running GC so it will be ignored, that leaves us with 0x382c, since it's in RUNNING, we will ask it to suspend ASYNC_SUSPEND_REQUESTED, and then we will wait for it to reach a safepoint in managed code. Apparently, that doesn't seem to happen for this thread, since if it hits a safepoint you would see some native frames on that threads callstack, so it's still executing managed code and appears to never reach a safepoint. Looking at previous successful GC, 0x382c is in BLOCKING state during first STW phase. Looking back to what the last loggings are for that thread we get:

04-29 17:17:02.533 14215 14380 D Mono : AOT NOT FOUND: System.Runtime.CompilerServices.AsyncTaskMethodBuilder1/AsyncStateMachineBox1<System.Threading.Tasks.VoidTaskResult, BB.Features.Resources.Remote.Update.RemoteResourceUpdatePollingService/d__14>:MoveNext ().
04-29 17:17:02.534 14215 14380 D Mono : AOT NOT FOUND: System.Runtime.CompilerServices.AsyncTaskMethodBuilder1/AsyncStateMachineBox1<System.Threading.Tasks.VoidTaskResult, BB.Features.Resources.Remote.Update.RemoteResourceUpdatePollingService/d__14>:MoveNext (System.Threading.Thread).
04-29 17:17:02.538 14215 14380 D Mono : AOT NOT FOUND: System.Runtime.CompilerServices.AsyncTaskMethodBuilder1/AsyncStateMachineBox1<System.Threading.Tasks.VoidTaskResult, BB.Features.Resources.Remote.Update.RemoteResourceUpdatePollingService/d__14>:ExecutionContextCallback (object).
04-29 17:17:02.539 14215 14380 D Mono : AOT: FOUND method BB.Features.ResourcesPreloading.IdleMonitor:WaitForIdleAsync () [0x70024847d0 - 0x7002484850 0x7003ca0893]

So it has at least been running that code some time before hitting the ANR.

The reason why you don't see a callstack in ANR for thread 0x382c is because Android native unwinder doesn't handle C# managed frames, so it can' t unwind through them, we would need to dump the managed callstack of the thread in order to see what code its running.

The AOT module includes pre-compiled native code, that we load and execute instead of running through JIT if found by loader. It normally reduces startup times since it eliminates the need to JIT compile a large number of methods during startup.

Managed code will look for safepoint when calling functions, handling exceptions and on loop back-edges, AOT compiled code should also include safepoints given that they have been compiled using a matching suspend model (hybrid), but there should be guards in Mono loader to prevent mismatch between AOT suspend model used when AOT compile libraries and runtime, so the AOT code should have safepoints included. I will try too disassembly your AOT module and see if it appears to have safepoints included and also take a look at the 000000000021e564 to see what the code seems to be doing.

It should also be worth testing not use the AOT compiled methods to rule out issues around missing safepoints in AOT compiled code. I believe it can be accomplished by setting msbuild property RunAOTCompilation to false when building app, but you need to double check that it does the right thing. If it works, you shouldn't get any libaot .so files in the APK.

If you could get hold of the managed stack of that thread that would be great. If not, I would see if we can implement something in response to ANR to dump thread stacks to logcat, but I need to finish some other work items before pursuing that task.

@TimBurik
Copy link
Author

@lateralusX Thanks for the analysis!

I've just managed to reproduce the ANR using the build with AOT disabled, and the thread responsible for blocking the GC has the following dump:

".NET Long Runni" sysTid=2728
    #00 pc 0000000000004664  <anonymous:738a343000>

I suppose it means that this issue is not AOT specific.

I'm not sure I would be able to get hold of the managed stacktrace exactly at the moment of the ANR without new updates to the runtime source code, but I could definitely investigate this blocking thread from the app logic perspective in the mean time.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Area: Mono Runtime Mono-related issues: BCL bugs, AOT issues, etc.
Projects
None yet
Development

No branches or pull requests

8 participants