Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: Parallel Arrow file format reading #8897

Merged
merged 5 commits into from
Jan 29, 2024

Conversation

my-vegetable-has-exploded
Copy link
Contributor

@my-vegetable-has-exploded my-vegetable-has-exploded commented Jan 17, 2024

Which issue does this PR close?

Closes #8503

Rationale for this change

What changes are included in this PR?

If file_meta.range is some, filter recordbatches according to range, then scan recordbatches.

Are these changes tested?

physical plan of scaning arrow files changes in repartition_scan.slt.

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jan 17, 2024
@my-vegetable-has-exploded my-vegetable-has-exploded marked this pull request as ready for review January 18, 2024 03:22
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me @my-vegetable-has-exploded -- thank you very much.

I think it is likely to not work well on remote object store given how many requests are made but I also think that could be handled by a follow on PR

My only concern with this PR as written is if the tests actually exercise the multi-batch reading code given how small the input files in repartition.slt are

@@ -61,6 +61,7 @@ postgres = ["bytes", "chrono", "tokio-postgres", "postgres-types", "postgres-pro
[dev-dependencies]
env_logger = { workspace = true }
num_cpus = { workspace = true }
tokio = { version = "1.0", features = ["rt-multi-thread"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

@@ -253,7 +253,16 @@ query TT
EXPLAIN SELECT * FROM arrow_table
----
logical_plan TableScan: arrow_table projection=[f0, f1, f2]
physical_plan ArrowExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow]]}, projection=[f0, f1, f2]
physical_plan ArrowExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:0..461], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:461..922], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:922..1383], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:1383..1842]]}, projection=[f0, f1, f2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 looks good to me -- though I I wonder will this actually read in parallel (or do these ranges all end up in the same reader)?

@alamb alamb merged commit 9bf0f68 into apache:main Jan 29, 2024
22 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallel Arrow file format reading
2 participants