Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[ML] Fix thread safety issues in accessing ManagedCursorContainer.heap ArrayList #16049

Merged

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Jun 14, 2022

Motivation

  • heap is an ArrayList which isn't thread safe and it's currently accessed concurrently.

This could lead to thread safety issues. Here's an example: (stacktraces are from a fork of branch-2.10)

2022-06-13T13:57:27,163+0000 [bookkeeper-ml-cache-eviction-19-1] WARN  org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Exception while performing cache eviction: Index 0 out of bounds for length 0
java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
	at jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64) ~[?:?]
	at jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70) ~[?:?]
	at jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248) ~[?:?]
	at java.util.Objects.checkIndex(Objects.java:372) ~[?:?]
	at java.util.ArrayList.get(ArrayList.java:459) ~[?:?]
	at org.apache.bookkeeper.mledger.impl.ManagedCursorContainer.getSlowestReadPositionForActiveCursors(ManagedCursorContainer.java:108) ~[com.datastax.oss-managed-ledger-2.10.0.5.jar:2.10.0.5]
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.getEarlierReadPositionForActiveCursors(ManagedLedgerImpl.java:2185) ~[com.datastax.oss-managed-ledger-2.10.0.5.jar:2.10.0.5]
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.doCacheEviction(ManagedLedgerImpl.java:2175) ~[com.datastax.oss-managed-ledger-2.10.0.5.jar:2.10.0.5]
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.lambda$doCacheEviction$4(ManagedLedgerFactoryImpl.java:282) ~[com.datastax.oss-managed-ledger-2.10.0.5.jar:2.10.0.5]
	at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4772) ~[?:?]
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.doCacheEviction(ManagedLedgerFactoryImpl.java:278) ~[com.datastax.oss-managed-ledger-2.10.0.5.jar:2.10.0.5]
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.cacheEvictionTask(ManagedLedgerFactoryImpl.java:263) ~[com.datastax.oss-managed-ledger-2.10.0.5.jar:2.10.0.5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.77.Final.jar:4.1.77.Final]
	at java.lang.Thread.run(Thread.java:829) [?:?]

Modifications

  • protect access using the existing rwLock

…eap" field access

- heap is an ArrayList which isn't thread safe
@lhotari
Copy link
Member Author

lhotari commented Jun 14, 2022

The thread safety issue was introduced in #12045 . A similar pattern was used in #14985 changes.

@lhotari lhotari requested a review from hangc0276 June 14, 2022 07:37
@lhotari
Copy link
Member Author

lhotari commented Jun 14, 2022

There's a chance that this fix causes a performance regression. If that's the case, we would have to revisit the solution for tracking the slowest positions.

@lhotari
Copy link
Member Author

lhotari commented Jun 14, 2022

When evaluating the #12045 and #14985 changes, it seems that the solution breaks broker caching. I'll create an issue for tracking this.

@lhotari
Copy link
Member Author

lhotari commented Jun 14, 2022

The issue about the broker cache regression is #16054

@315157973
Copy link
Contributor

The previous PR is used to solve the memory problem caused by frequent traversal.
I don't think add a read lock for cache evictions have any impact on the Broker's read and write performance.

Copy link
Contributor

@Jason918 Jason918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. heap should be guarded by rwLock.

@michaeljmarshall
Copy link
Member

There's a chance that this fix causes a performance regression. If that's the case, we would have to revisit the solution for tracking the slowest positions.

@lhotari - is there a specific test scenario that we should run to proactively check for a performance regression? The changes looks good to me, but it'd be helpful to diagnose a performance regression sooner than later, if possible.

codelipenghui pushed a commit that referenced this pull request Jun 28, 2022
@codelipenghui codelipenghui added the type/bug The PR fixed a bug or issue reported a bug label Jun 28, 2022
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Jul 4, 2022
…eap" field access (apache#16049)

(cherry picked from commit ec9676f)
(cherry picked from commit cec950e)
dragonls pushed a commit to dragonls/pulsar that referenced this pull request Oct 21, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area/broker cherry-picked/branch-2.10 doc-not-needed Your PR changes do not impact docs release/2.10.2 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants