-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
HDFS-17342. Fix DataNode may invalidates normal block causing missing block #6464
Conversation
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! LGTM.
@smarthanwang This is a bug fix after HDFS-16985 , do you have time to help review this?
Hi @zhangshuyan0 thanks for your review . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haiyang1987 Thanks for work. It makes sense to me, but more test cases need I think
Future<?> blockReaderFuture = executorService.submit(() -> { | ||
try { | ||
// Submit tasks for reading block. | ||
BlockReaderTestUtil.getBlockReader(cluster.getFileSystem(), blk, 0, 512); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoud we check FNE's thrown as expect here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @smarthanwang for your review.
here will not be thrown java.io.FileNotFoundException
, because this exception will be captured output to the exception stack when processing the DataXceiver.
if we expect check FileNotFoundException
maybe need to call the initialization BlockSender
directly.
if (replica != null && (!checkFiles || | ||
(!replica.blockDataExists() || !replica.metadataExists()))) { | ||
volumeMap.remove(bpid, block); | ||
invalidate(bpid, replica); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If replica == null, invalidate(bpid, replica);
would not execute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If replica == null
should not need to execute invalidate(bpid, replica)
avoid cause NPE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, get it
} | ||
|
||
// Validate the replica is exits. | ||
assertNotNull(dnFSDataset.getReplicaInfo(blk.getBlock())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need one more case to check the ReplicaInfo
and data file
would lost when checkFile=false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestFsDatasetImpl#tesInvalidateMissingBlock line[1991-1997]
here case will simulate the ReplicaInfo would be removed from ReplicaMap when checkFile=false
// Mock local block file not found when disk with some exception.
fsdataset.invalidateMissingBlock(bpid, replicaInfo, false);
// Assert local block file wouldn't be deleted from disk.
assertTrue(blockFile.exists());
// Assert block info would be removed from ReplicaMap.
assertEquals("null", fsdataset.getReplicaString(bpid, replicaInfo.getBlockId()));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It tests cases the block file not found for any causes. But I am not sure whether the situation as your description would lead to FNE, so I think the case should be constructed and tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is to verify whether the UT can reproduce FNE, we can add the following code for verification
GenericTestUtils.LogCapturer logCapturer = GenericTestUtils.LogCapturer.captureLogs(DataNode.LOG);
ReplicaInfo tmpReplicaInfo = dnFSDataset.getReplicaInfo(blk.getBlock());
// Check DN log for FileNotFoundException.
String expectedMsg = String.format("opReadBlock %s received exception " +
"java.io.FileNotFoundException: %s (No such file or directory)",
blk.getBlock(), tmpReplicaInfo.getMetadataURI().getPath());
assertTrue("Expected log message not found in DN log.",
logCapturer.getOutput().contains(expectedMsg));
How about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
@smarthanwang I have a question about HDFS-16985, Normally FileNotFoundException means that the meta file or data file maybe lost, so the replication on this datanode maybe corrupt, right? In your business(AWS EC2 + EBS) situation, you don't expect datanode to delete this replica directly, so HDFS-16985 just remove the replica from the memory of DN. But I want to see that DN should directly delete this corrupt replica If it can ensure that the replica is corrupt, such as: meta file or data file is lost. If @smarthanwang @zhangshuyan0 looking forward to your good ideas. |
Thanks @ZanderXu for your comment. |
Agree with you, it's better to use parameter to control corrupt block deletion. |
Update UT, @ZanderXu @zhangshuyan0 @smarthanwang please help review it again, thanks~ |
BTW, if necessary i'm very willing to implement this function. |
💔 -1 overall
This message was automatically generated. |
Thanks @smarthanwang for your response. @haiyang1987 Can you create a new ticket to support it? |
Sure, l will create a new ticket to implement it later, thanks~ |
1b21c05
to
d023cff
Compare
Resolve conflicts with trunk branch. |
💔 -1 overall
This message was automatically generated. |
The failed unit test seems unrelated to the change. |
Hi @zhangshuyan0 @ZanderXu @smarthanwang Could you mind to push this modification forward when you have free time ? Thank you very much. |
Hi @smarthanwang , do you have any further comments on this PR? |
String expectedMsg = String.format("opReadBlock %s received exception " + | ||
"java.io.FileNotFoundException: %s (No such file or directory)", | ||
blk.getBlock(), tmpReplicaInfo.getMetadataURI().getPath()); | ||
assertTrue("Expected log message not found in DN log.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, there are some deviations in my understanding in last above reply . In short words, I think we should verify two cases:
- block from
rbw
tofinialized
,without this patch: verify FNE wuold be thrown andReplicaInfo
would be removed - block from
rbw
tofinialized
, with this patch: verify FNE wuold not be thrown andReplicaInfo
would
not be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @smarthanwang for your comment.
In fact, current case block from rbw to finalized
anyway will throw FNE
.
The current PR is to solve the ReplicaInfo
will not be remove in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ZanderXu @zhangshuyan0 @smarthanwang do you have any further comments on this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, current case
block from rbw to finalized
anyway will throwFNE
.
OK,can we reproduce it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @smarthanwang If I understand correctly, here the UT has reproduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
String expectedMsg = String.format("opReadBlock %s received exception " + | ||
"java.io.FileNotFoundException: %s (No such file or directory)", | ||
blk.getBlock(), tmpReplicaInfo.getMetadataURI().getPath()); | ||
assertTrue("Expected log message not found in DN log.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, current case
block from rbw to finalized
anyway will throwFNE
.
OK,can we reproduce it?
Hi @zhangshuyan0 @ZanderXu @smarthanwang please help to push this modification forward If there are no any further comments. Thank you very much~ |
Committed to trunk. Thanks for your contribution @haiyang1987 @ZanderXu @smarthanwang . |
Thanks @zhangshuyan0 @ZanderXu @smarthanwang for your help review and merge. |
Description of PR
https://issues.apache.org/jira/browse/HDFS-17342
When users read an append file, occasional exceptions may occur, such as org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx.
This can happen if one thread is reading the block while writer thread is finalizing it simultaneously.
Root cause:
As described above, when the reader thread encountered FileNotFoundException is as expected, because the file is moved.
So we need to add a double check to the invalidateMissingBlock logic to verify whether the data file or meta file exists to avoid similar cases.