Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[HUDI-5008] Avoid unset HoodieROTablePathFilter in IncrementalRelation #6921

Merged
merged 1 commit into from
Nov 16, 2022

Conversation

boneanxs
Copy link
Contributor

Change Logs

If users create an incrementalRelation while join another existing hive hudi table, as pathFilter is unset inside incrementalRelation, all files under hive hudi table will be selected.

Now HoodieROTablePathFilter can accept as.of.instant to do the time travel, so instead we pass as.of.instant to the dataframe(not change spark hadoop conf globally) to avoid this issue.

Impact

Risk level: low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@@ -167,12 +167,15 @@ class TestCOWDataSourceStorage extends SparkClientFunctionalTestHarness {
// Read Incremental Query
// we have 2 commits, try pulling the first commit (which is not the latest)
val firstCommit = HoodieDataSourceHelpers.listCommitsSince(fs, basePath, "000").get(0)
// Setting HoodieROTablePathFilter here to test whether pathFilter can filter out correctly for IncrementalRelation
spark.sparkContext.hadoopConfiguration.set("mapreduce.input.pathFilter.class", "org.apache.hudi.hadoop.HoodieROTablePathFilter")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here fix the test added: #458

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we setting a filter from the test? Isn't it supposed to be set by the Relation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intend of the test is to test if HoodieROTablePathFilter is set, the incremental relation still can read the old data correctly. But this test doesn't work as our expect, as HoodieROTablePathFilter is not set by default.

can see if we run this test without setting pathFilter explicitly.
Screen Shot 2022-11-12 at 14 59 41

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now. Thanks for clarifying!
Shouldn't we write the test that would set HoodieROTablePathFilter (by using globbing for ex)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, there is a test in org.apache.hudi.functional.TestCOWDataSource#testReadPathsOnCopyOnWriteTable

.option(DataSourceReadOptions.READ_PATHS.key, record1FilePaths)
actually test this. toHadoopRelation will add the HoodieROTablePathFilter, and this test also contains old version files.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@boneanxs
Copy link
Contributor Author

Hi @alexeykudinkin could you plz help to review this pr?

@yihua yihua added priority:critical production down; pipelines stalled; Need help asap. incremental-query labels Oct 15, 2022
@boneanxs
Copy link
Contributor Author

Gentle ping @alexeykudinkin

@@ -167,12 +167,15 @@ class TestCOWDataSourceStorage extends SparkClientFunctionalTestHarness {
// Read Incremental Query
// we have 2 commits, try pulling the first commit (which is not the latest)
val firstCommit = HoodieDataSourceHelpers.listCommitsSince(fs, basePath, "000").get(0)
// Setting HoodieROTablePathFilter here to test whether pathFilter can filter out correctly for IncrementalRelation
spark.sparkContext.hadoopConfiguration.set("mapreduce.input.pathFilter.class", "org.apache.hudi.hadoop.HoodieROTablePathFilter")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now. Thanks for clarifying!
Shouldn't we write the test that would set HoodieROTablePathFilter (by using globbing for ex)

@alexeykudinkin alexeykudinkin merged commit 57961c0 into apache:master Nov 16, 2022
satishkotha pushed a commit to satishkotha/incubator-hudi that referenced this pull request Dec 12, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
incremental-query priority:critical production down; pipelines stalled; Need help asap.
Projects
Status: 🚧 Needs Repro
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants