Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Consuming Stopped in Pinot 1.2.0 #11613

Open
donghun-cho opened this issue Oct 24, 2024 · 1 comment
Open

Consuming Stopped in Pinot 1.2.0 #11613

donghun-cho opened this issue Oct 24, 2024 · 1 comment
Labels

Comments

@donghun-cho
Copy link
Contributor

Problem Description

After upgrading Pinot from version 1.0.0 to 1.2.0 and running it for several weeks, consumption stopped from one partition.
Later, consuming stopped from all partitions handled by a single Pinot server.

How to Check

In the Pinot server log, Helix's completed tasks value is not increasing, and the queued tasks value is increasing with active threads=40

stopped server log

[2024-10-10 17:38:06.394] INFO [HelixTaskExecutor] [ZkClient-EventThread-108-real-zookeeper:2181] Scheduling message b2f4a6ab-1ce2-42f8-9207-48020206c2f5: inspectorStatAgent00_REALTIME:inspectorStatAgent00__24__144__20241008T0216Z, ONLINE->OFFLINE
[2024-10-10 17:38:06.394] INFO [HelixTaskExecutor] [ZkClient-EventThread-108-real-zookeeper:2181] Submit task: b2f4a6ab-1ce2-42f8-9207-48020206c2f5 to pool: java.util.concurrent.ThreadPoolExecutor@7fa15569[Running, pool size = 40, active threads = 40, queued tasks = 8721, completed tasks = 52682]
[2024-10-10 17:38:06.394] INFO [HelixTaskExecutor] [ZkClient-EventThread-108-real-zookeeper:2181] Message: b2f4a6ab-1ce2-42f8-9207-48020206c2f5 handling task scheduled

What to Do!

This issue can be resolved by restarting the Pinot server, but that won't prevent the problem from recurring.
To avoid data loss, adjust the Kafka retention period and the replicasPerPartition value of the realtime table.

Pinot already has a pull request for this issue. apache/pinot#13632


More Details on the Cause of the Problem

For a single partition, the following occurs sequentially

  1. helix task (OFFLINE->CONSUMING) scheduled for new segment
  2. A request to /segmentConsumed returns a KEEP response.
    • which will make consuming thread to build(complete) segment

From the thread dump, the consuming thread is waiting for the segment lock, which is held by the (OFFLINE->CONSUMING) task thread.
The (OFFLINE->CONSUMING) task needs to acquire the _partitionGroupConsumerSemaphore, which is not released by the consuming thread.

@emeroad emeroad added the bug label Oct 24, 2024
@emeroad emeroad pinned this issue Oct 24, 2024
@donghun-cho
Copy link
Contributor Author

The Pinot team is considering releasing a patched version.
Waiting for the release can be another option.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants