-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Suspect deadlock in InternalLoggerRegistry (intermittent) #3399
Comments
@nilsson-petter, thanks so much for the report. It indeed appears like a bug.
|
Thank you for the quick response! I can find six references of that method in the thread dump. However, they are all stuck at
I have attached a redacted thread dump. |
I will provide a bit more context for this issue. We use Spring Kafka in this application. We are consuming one topic, using ordinary threads. We are producing to another topic, doing so with a thread pool of 32 Java 21 Virtual Threads. Perhaps this is not playing along very nicely at times. We copied
The logging statements contain context on which thread/name of the logger that is aqcuiring the lock. It starts to look a bit funky closer to the end, with
At the end, we have three statements of WAITING for the
|
Happened again in the same way, with
|
I have managed to reproduce the deadlock in our codebase. When switching to non-virtual threads no deadlock can be observed. |
@nilsson-petter, would you mind testing if you can still reproduce the problem using virtual threads and JDK 241 EA, please? 1 JDK 24 will ship JEP 491 delivering several improvements for pinning issues. |
I can not reproduce it with virtual threads using JDK 24 instead of 21. Local tests run on
|
We've been seeing these in the Infinispan testsuite and we don't enable vthreads by default.
|
@tristantarrant, which Log4j version? Which JDK distribution and version? |
log4j It was not happening with |
@tristantarrant, could you provide a minimal reproducer, please? |
@tristantarrant, is the stack trace you shared complete/exhaustive? If not, could you provide a complete one, please? |
Not I trimmed only the stack traces that were locked on that. I'm working on the simplest possible reproducer. |
I am stumbling upon (probably) the same problem and can reliably reproduce it. I could share stack traces if that helps, but sadly I cannot hand out a minimal example because we are using an internal library in our setup. Our setup reads from a zookeeper instance which log levels are configured, but the classes that do that contain loggers themselves. Until now this wasn't a problem. I suspect that due to the changes in locking between |
@freelon, would you mind helping with the following questions, please?
Would you mind elaborating on this, please? Maybe with a pseudo code example demonstrating the case? |
@vy So for the easy part:
For the pseudo-code part I'll try my best explaining it here 🙈 : We have a custom It overwrites
Zookeeper/Curator uses multiple threads (as you can see in the stack traces) to coordinate itself. While the application Since the tcp connection thread is now blocked the connection to the zookeeper server cannot succeed and therefor the main thread, waiting for the connection watcher going to status CONNECTED never unblocks, never releasing the InternalLoggerRegistry lock etc. Is this helpful and understandable? If this does not get you any further I will invest some time to try and build a reproducable example that I can share. |
Looking into the stack traces further (I looked at the traces when Previously:
With the new
So the difference seems that in the old implementation the actual initialization of a logger was done outside the lock, now it's done while holding the write lock, causing the deadlock in our case. |
It is very clear thanks. We have already seen some unwanted effects if
Since all loggers with the same name are equivalent, we don't need to ensure that the logger is created only once. @vy, what do you think? |
It has been reported that holding a lock on `InternalLoggerRegistry` during the creation of a logger can cause performance loss and deadlocks. The logger constructor can trigger property lookups and other pluggable operations, we don't entirely control. The fix to #3252 only covered one of these cases. This change moves the instantiation of new `Logger`s outside the write lock. While in some cases, this will cause multiple instantiations of loggers with the same parameters, all such loggers are functionally equivalent. On the other hand, the change allows the creation of different loggers in parallel. Closes #3399
It has been reported that holding a lock on `InternalLoggerRegistry` during the creation of a logger can cause performance loss and deadlocks. The logger constructor can trigger property lookups and other pluggable operations, we don't entirely control. The fix to #3252 only covered one of these cases. This change moves the instantiation of new `Logger`s outside the write lock. While in some cases, this will cause multiple instantiations of loggers with the same parameters, all such loggers are functionally equivalent. On the other hand, the change allows the creation of different loggers in parallel. Closes #3399
Yes, I'm sure: I modified the Infinispan pom.xml to use that SNAPSHOT version and ran its core testsuite using Maven in offline mode and with debug options to ensure that the Surefire classpath pointed to the correct jar. I also used OpenJDK 17 to avoid any potential vthread interference. https://gist.github.com/tristantarrant/93b0a80974a14cbc01d6d3ed71f540ba The above includes read locks: There are 84 threads waiting for the same lock ( |
@tristantarrant, thanks so much for the detailed analysis, really appreciated it. 🙇 I examined the stack trace, but I still cannot see a smoking gun. Would you mind trying if one of the following helps, please? Edit
After changes, you can perform a quick local install of
|
Neither of those suggestions improved things. However I have some additional info.
the Here is an example when bad things start happening. I get 100s of the following:
The Weak semantics are causing it to not find the entry in the map, so every single request for the |
@tristantarrant do I understand correctly that your test is not going into a deadlock but instead blocks many threads because the write lock is "just" really busy? |
From the way I described it, it does come across like that, but observing the log, at some point it just hangs. |
Ok, I've researched it further: the problem is the |
I tried to replace the .computeIfAbsent(name, ignored -> new WeakReference<>(newLogger)) with .compute(name, (ignored, ref) -> (ref == null || ref.get() == null) ? new WeakReference<>(newLogger) : ref)) but it doesn't solve the hanging threads. |
@tristantarrant, would you mind providing us the minimal set of steps we can reproduce the issue on our side? Clone this Infinispan repo, compile like this, run the test like this, etc. |
@tristantarrant, I presume the stack trace you shared last is during the deadlock, not while |
Yes: our testsuite has a test-killer which detects if its been unresponsive after a timeout. It will get a threaddump of the JVM and then kill itself. I believe I've figured out why it's happening a lot to our testsuite:
I've identified one place in our code which was unnecessarily obtaining a logger in a method and replaced it with a static instance. This has reduced the lock contention considerably, but doesn't solve the issue. I've created an Infinispan branch with my small fix which you can use for your investigations.
Just let it run. After a while it will hang and start creating |
I have some good news: combining the fix in #3399 (comment) with my fix in https://github.com/tristantarrant/infinispan/tree/log4j-lock-investigation seems to allow our core testsuite to complete. |
@tristantarrant, great news! I am looking forward to your PR. 🤩 I suggest taking #3418 as the base, yet I think there is a (performance) bug that needs to be addressed. |
Yes, fixing our "abuse" of per-instance, or worse, local, loggers doesn't solve the problem that with high contention this will deadlock. |
Here's the Infinispan issue to remove local loggers: infinispan/infinispan#13927 |
Maybe we should look for a lock-less solution. A ConcurrentHashMap with support for weak keys ? |
CHM uses locks too. For a lock-free design, we can implement a
copy-on-write in combination with a CAS loop on an AtomicReference, but
this design would have its own performance trade offs. I am inclined to
keep the current simple rw-lock design. I think your stress test choking up
on w-lock contention doesn't reflect a real world use case to engineer
further the concurrency of InternalLR.
Tristan, doesn’t #3418, plus your one-liner fixes address your problem? I
thought you will submit a PR for that, right?
Op do 30 jan 2025 om 16:34 schreef Tristan Tarrant ***@***.***
…
Maybe we should look for a lock-less solution. A ConcurrentHashMap with
support for weak keys ?
—
Reply to this email directly, view it on GitHub
<#3399 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAARTSMMV4W4G5OML2POWSD2NJBAHAVCNFSM6AAAAABVILZIVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRUHAZDOOJWGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes, I'll submit the pr tomorrow morning. Sorry, I was busy on other things
…On Thu, 30 Jan 2025, 21:17 Volkan Yazıcı, ***@***.***> wrote:
CHM uses locks too. For a lock-free design, we can implement a
copy-on-write in combination with a CAS loop on an AtomicReference, but
this design would have its own performance trade offs. I am inclined to
keep the current simple rw-lock design. I think your stress test choking
up
on w-lock contention doesn't reflect a real world use case to engineer
further the concurrency of InternalLR.
Tristan, doesn’t #3418, plus your one-liner fixes address your problem? I
thought you will submit a PR for that, right?
Op do 30 jan 2025 om 16:34 schreef Tristan Tarrant ***@***.***
>
> Maybe we should look for a lock-less solution. A ConcurrentHashMap with
> support for weak keys ?
>
> —
> Reply to this email directly, view it on GitHub
> <
#3399 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AAARTSMMV4W4G5OML2POWSD2NJBAHAVCNFSM6AAAAABVILZIVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRUHAZDOOJWGU>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#3399 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAC72HTX6CTU7GUY2PSE5AD2NKCGJAVCNFSM6AAAAABVILZIVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRVGQ4DSNBQGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It has been reported that holding a lock on `InternalLoggerRegistry` during the creation of a logger can cause performance loss and deadlocks. The logger constructor can trigger property lookups and other pluggable operations, we don't entirely control. The fix to apache#3252 only covered one of these cases. This change moves the instantiation of new `Logger`s outside the write lock. While in some cases, this will cause multiple instantiations of loggers with the same parameters, all such loggers are functionally equivalent. On the other hand, the change allows the creation of different loggers in parallel. Closes apache#3399
@tristantarrant, checking #3427 you recently submitted... Implementing our own You have been very kind and of great help. I would appreciate if you can give the most recent #3418 one more spin in your test suite. |
Not implementing reference reaping means that the inner map will contain useless entries. Also, if you notice, I only implemented the most basic parts of the map, so it's not really a general purpose class. |
@tristantarrant, thanks so much for your understanding. I will merge #3418 (waiting for @ppkarwasz, since he is first original author) and mark this issue as closed. I created #3430 to see what we can do for entries referring to reclaimed loggers. |
@ppkarwasz, thanks so much. 😍 Though we're not in a big hurry. I prefer to wait for your review. |
Fixed by #3418. |
Description
Intermittent deadlock/thread starvation on startup in
InternalLoggerRegistry#getLoggers
on Spring Boot 3.3.5 application.Configuration
Version:
2.24.3
Operating system:
Red Hat Enterprise Linux 8.10 (Ootpa)
Kernel: Linux 4.18.0-553.27.1.el8_10.x86_64
JDK:
openjdk version "21.0.5" 2024-10-15 LTS
OpenJDK Runtime Environment (Red_Hat-21.0.5.0.10-1) (build 21.0.5+10-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-21.0.5.0.10-1) (build 21.0.5+10-LTS, mixed mode, sharing)
Logs
I can't find the thread that has locked
<0x00000000e01f0be0>
.Reproduction
Happens intermittently on deploy. Can't reproduce it at will.
The text was updated successfully, but these errors were encountered: