-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738
Comments
Does the job work well without auto-scale? What is the state of the pipeline after the checkpoint fails, does the writer stil handle inputs? |
Yes, Job works with and without auto-scale if we don't enable MDT. |
Is there any sepcial logs in the JM logging? |
I haven't seen any special log for this, usually, it fails the checkpoint either by autoscaling spin up new TM or if I kill the TM manually and it discards the data post that. Thanks |
Are you saying the pipeline just hangs up there and does nothing? |
No, it simply moves to the next checkpointing and processes nothing. If you observe the 4th checkpointing, it is currently processing millions of records (in progress, as it's not completed) and taking approximately 4 minutes. However, once it fails, it moves to the next checkpointing and processes nothing, completing in milliseconds(same goes for 2 and 3). Thanks |
@danny0405 any findings? |
No, without more detailed logs I can not help more, and I have no knowledge of Flink auto-scaling. |
@maheshguptags do you mean after the checkpoint failed, records ingested after that will loss? if that happens, usually there are two types of problems:
btw, could you also paste the exception trace in the flink dashboard too? |
Hi
@cshuo Let's assume 10 million records are ingested into the job. While processing these records, if the job manager triggers the creation of a new Task Manager (TM) due to auto-scaling, or if a TM is manually removed (to test the scenario without auto-scaling), a checkpoint failure could occur, causing all the previously ingested data (the 10 million records) to be discarded. If new data (e.g., 1 million records) is ingested after the checkpoint failure, the new data will be successfully processed and ingested to Hudi, provided the next checkpoint succeeds. To summarize: Ingest 10M records → checkpoint failure (due to TM change) → discard all data Thanks |
So these records does not even have a complete checkpoint lifecycle and no commits occur. |
I used an example to illustrate the issue. example chkpnt1 → succeeded → ingested 2.5M (out of 10M) Attempts the next checkpoint chkpnt4 → succeeded → no data will be written due to the failure at chkpnt3 and the checkpoint will complete within milliseconds. Thanks |
@danny0405 @cshuo, any progress on this?
You can try it without auto-scaling by forcefully deleting the TM while checkpointing is in progress (causing the checkpointing to fail) with MDT enabled. We can also connect over a call. Thanks |
@maheshguptags The checkpoint failure can be happened in different phase, e.g., during snapshot state in TM or notify checkpoint complete in JM. Can this problem be stably reproduced? If yes, could you provide a jobmanager.log which contains the exception log for ckp failure. |
Sure @cshuo
![]() Thanks |
@maheshguptags Based on the log you provided, the issue still cannot be pinpointed. I'm trying to reproduce the problem locally based on:
However I cannot reproduce the problem. The demo code is here, can you try reproducing the problem based on the code? It would be great if you can reproduce the problem based on the demo code, so we can further figure out the problem. |
@cshuo I will try to run the demo code locally. However, upon reviewing the code, it looks like the MDT hasn't been enabled for the Hudi table during its creation. I'm assuming you're using Hudi 1.0.0. The configuration for MDT that I've been using when creating the Flink Hudi table is: Could you please try adding this configuration on your end? |
Yes, I'm using 1.0.0, and actually MDT is enabled by default, and concurrency.mode is 'SINGLE_WRITER' by default. |
@cshuo is it possible to get on call? we can discuss the issue and reproducible steps. thanks |
@cshuo, I ran the sample code you provided, and it works as expected. However, it doesn't replicate the issue I'm encountering. In your code, the Task Manager terminates after the checkpointing is completed, but in our case, we manually delete the Task Manager while the checkpointing is still in progress and hasn't been completed yet. Additionally, I am providing the MDT configuration when creating the DDL. Do you think this could have an impact? From my understanding, it shouldn't. Could you try deleting the Task Manager while checkpointing is ongoing? Thanks! |
Thanks to @cshuo for connecting over the call and offering suggestions. I spoke with @cshuo, and he recommended removing @cshuo, one question arises: If Another approach I could try is to completely remove it(MDT-related config), as you did(keep default), and then do the testing. also attaching table creation DDL as discussed :
error after enabling
|
@cshuo Since MDT is the default in 1.0 hudi and it requires the lock. The File system lock is not recommended for Production and S3. Also
Yes, it requires a lock if not provide any MDT config.
Will keep you posted. |
@maheshguptags thks for updating, as discussed in the call, there are directions you can try:
btw, I tried this, and unfortunately, the problem can not be reproduced either. |
@cshuo Sure, I will give it a try for lower parallelism. |
Issue
While performing load testing with METADATA enabled, I encountered a data loss issue. The issue occurs when deploying the job with Autoscale enabled. Specifically, if checkpointing fails due to reasons such as TM add-ons or memory heap issues, all data is discarded, and no further data is processed after that failure.
Checkpointing failures lead to data loss.
After a failed checkpoint due to lack of resources, a new checkpoint is triggered but no data is processed.
I tried to replicate this behavior on Hudi 1.0, and the same issue persists.
Hudi Properties
Steps to reproduce the behavior:
Expected behavior
After checkpoint failure due to resource issues, the system should continue processing data once resources are available, without losing previously processed data.
Environment Description
Hudi version : 1.0.0
Flink version: 1.18
Hive version : NO
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : Yes
Table Type: COPY_ON_WRITE
Additional context
Can the Hudi team assist with troubleshooting this issue? Is this expected behavior with METADATA enabled, or is there a bug with flink under resource constraint scenarios?
cc: @codope @bhasudha @danny0405 @xushiyan @ad1happy2go @yihua
The text was updated successfully, but these errors were encountered: