-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
feat: Parallel Arrow file format reading #8897
feat: Parallel Arrow file format reading #8897
Conversation
8e4d9f2
to
9348b09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me @my-vegetable-has-exploded -- thank you very much.
I think it is likely to not work well on remote object store given how many requests are made but I also think that could be handled by a follow on PR
My only concern with this PR as written is if the tests actually exercise the multi-batch reading code given how small the input files in repartition.slt
are
@@ -61,6 +61,7 @@ postgres = ["bytes", "chrono", "tokio-postgres", "postgres-types", "postgres-pro | |||
[dev-dependencies] | |||
env_logger = { workspace = true } | |||
num_cpus = { workspace = true } | |||
tokio = { version = "1.0", features = ["rt-multi-thread"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
@@ -253,7 +253,16 @@ query TT | |||
EXPLAIN SELECT * FROM arrow_table | |||
---- | |||
logical_plan TableScan: arrow_table projection=[f0, f1, f2] | |||
physical_plan ArrowExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow]]}, projection=[f0, f1, f2] | |||
physical_plan ArrowExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:0..461], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:461..922], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:922..1383], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:1383..1842]]}, projection=[f0, f1, f2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 looks good to me -- though I I wonder will this actually read in parallel (or do these ranges all end up in the same reader)?
Which issue does this PR close?
Closes #8503
Rationale for this change
What changes are included in this PR?
If file_meta.range is some, filter recordbatches according to range, then scan recordbatches.
Are these changes tested?
physical plan of scaning arrow files changes in
repartition_scan.slt
.Are there any user-facing changes?