-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix sequential metadata fetching in ListingTable causing high latency #14918
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
447a945
to
ee06d47
Compare
ee06d47
to
6ac44b9
Compare
let file_list = stream::iter(file_list).flatten(); | ||
let meta_fetch_concurrency = | ||
ctx.config_options().execution.meta_fetch_concurrency; | ||
let file_list = stream::iter(file_list).flatten_unordered(meta_fetch_concurrency); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the actual fix, the rest of the changes are unit tests.
6ac44b9
to
5d3bb48
Compare
@@ -1115,7 +1117,7 @@ impl ListingTable { | |||
} | |||
}) | |||
.boxed() | |||
.buffered(ctx.config_options().execution.meta_fetch_concurrency); | |||
.buffer_unordered(ctx.config_options().execution.meta_fetch_concurrency); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordering does not matter here, so might as well use buffer_unordered
to avoid stalling on an outlier.
Thanks @geoffreyclaude -- I kicked off the CI for this |
5d3bb48
to
fbe7be5
Compare
When scanning an exact list of remote Parquet files, the ListingTable was fetching file metadata (via head calls) sequentially. This was due to using `stream::iter(file_list).flatten()`, which processes each one-item stream in order. For remote blob stores, where each head call can take tens to hundreds of milliseconds, this sequential behavior significantly increased the time to create the physical plan. This commit replaces the sequential flattening with concurrent merging using `stream::iter(file_list).flatten_unordered(meta_fetch_concurrency)`. With this change, the `head` requests are executed in parallel (up to the configured `meta_fetch_concurrency` limit), reducing latency when creating the physical plan. Note that the ordering loss introduced by `flatten_unordered` is perfectly acceptable as the file list will anyways be fully sorted by path in `split_files` before being returned. Additionally, tests have been updated to ensure that metadata fetching occurs concurrently.
fbe7be5
to
94177d3
Compare
I pushed a commit to fix the formatting and try to get a clean CI run |
Thanks! I had environment issues which prevented me from running the CI locally. Figured adding an extra basic unit test was simple enough to skip it 🤦 |
.execution | ||
.meta_fetch_concurrency; | ||
let expected_concurrency = files.len().min(meta_fetch_concurrency); | ||
let head_blocking_store = ensure_head_concurrency(store, expected_concurrency); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very cool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @geoffreyclaude -- this is a really nice PR: Well documented, and well tested 🦾
"{} received head call for {location}", | ||
BlockingObjectStore::NAME | ||
); | ||
// Wait until the expected number of concurrent calls is reached, but timeout after 1 second to avoid hanging failing tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I tried to verify these changes but I couldn't figure out how to create an external table with explicitly listing the names via SQL. For posterity here is what I tried: This PR:
Main
I am going to try and whip up a smaller reproducer |
We got there through Substrait, by converting a custom input into a |
I still could not reproduce any improvement with this PR, FWIW. I still think it is a good change so i merged it in, but it might be cool to find some benchmark results that showed the improvement Details
use std::sync::Arc;
use std::time::Instant;
use datafusion::datasource::file_format::parquet::ParquetFormat;
use datafusion::datasource::listing::{ListingOptions, ListingTable, ListingTableConfig, ListingTableUrl};
use datafusion::execution::object_store::ObjectStoreUrl;
use datafusion::prelude::SessionContext;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let ctx = SessionContext::new();
let object_store_url = ObjectStoreUrl::parse("https://datasets.clickhouse.com").unwrap();
let object_store = object_store::http::HttpBuilder::new()
.with_url(object_store_url.as_str())
.build()
.unwrap();
ctx.register_object_store(object_store_url.as_ref(),
Arc::new(object_store));
// urls are like
// https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'
//let base_url = ObjectStoreUrl::parse("https://datasets.clickhouse.com").unwrap();
let paths: Vec<ListingTableUrl> = (1..100).map(|i| format!("https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_{i}.parquet"))
.map(|url| ListingTableUrl::parse(&url).unwrap())
.collect();
let listing_options = ListingOptions::new(Arc::new(ParquetFormat::new()))
.with_collect_stat(true);
let start = Instant::now();
println!("Creating table / reading statistics....");
let config = ListingTableConfig::new_with_multi_paths(paths)
.with_listing_options(listing_options)
.infer_schema(&ctx.state()).await?;
let listing_table = ListingTable::try_new(config).unwrap();
let df = ctx.read_table(Arc::new(listing_table))?;
println!("Done in {:?}", Instant::now() - start);
println!("running query");
let start = Instant::now();
let batches = df.limit(0, Some(10))?.collect().await.unwrap();
println!("Got {} batches in {:?}", batches.len(), Instant::now() - start);
Ok(())
}
Some testing numbers (the results vary wildly) On this branch
On main
|
Thanks again @geoffreyclaude |
@alamb: I've updated your reproducer to do a Without the fix, the physical plan creation takes ~2.5 seconds. The logical planning and query execution take the same time in both. use arrow::util::pretty::pretty_format_batches;
use datafusion::config::{ParquetOptions, TableParquetOptions};
use datafusion::datasource::file_format::parquet::ParquetFormat;
use datafusion::datasource::listing::{
ListingOptions, ListingTable, ListingTableConfig, ListingTableUrl,
};
use datafusion::execution::object_store::ObjectStoreUrl;
use datafusion::functions_aggregate::expr_fn::count;
use datafusion::logical_expr::utils::COUNT_STAR_EXPANSION;
use datafusion::physical_plan::displayable;
use datafusion::prelude::{Expr, SessionContext};
use futures::StreamExt;
use std::sync::Arc;
use std::time::Instant;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let ctx = SessionContext::new();
let object_store_url =
ObjectStoreUrl::parse("https://datasets.clickhouse.com").unwrap();
let object_store = object_store::http::HttpBuilder::new()
.with_url(object_store_url.as_str())
.build()
.unwrap();
ctx.register_object_store(object_store_url.as_ref(), Arc::new(object_store));
// urls are like
// https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'
//let base_url = ObjectStoreUrl::parse("https://datasets.clickhouse.com").unwrap();
let paths: Vec<ListingTableUrl> = (1..100).map(|i| format!("https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_{i}.parquet"))
.map(|url| ListingTableUrl::parse(&url).unwrap())
.collect();
let listing_options =
ListingOptions::new(Arc::new(ParquetFormat::new())).with_collect_stat(true);
println!("Creating table / reading statistics....");
let config = ListingTableConfig::new_with_multi_paths(paths)
.with_listing_options(listing_options)
.infer_schema(&ctx.state())
.await?;
let start = Instant::now();
let listing_table = ListingTable::try_new(config).unwrap();
let df = ctx
.read_table(Arc::new(listing_table))?
.aggregate(vec![], vec![count(Expr::Literal(COUNT_STAR_EXPANSION))])?;
println!("Done in {:?}", Instant::now() - start);
let logical_plan_str = df.logical_plan().display_indent().to_string();
println!("Logical plan:\n{logical_plan_str}");
println!("Creating physical plan...");
let start = Instant::now();
let physical_plan = df.create_physical_plan().await?;
println!("Done in {:?}", Instant::now() - start);
let physical_plan_str = displayable(physical_plan.as_ref()).indent(true).to_string();
println!("Physical plan:\n{physical_plan_str}");
println!("Running query...");
let start = Instant::now();
let mut result_stream = physical_plan.execute(0, ctx.task_ctx())?;
let mut batches = vec![];
while let Some(record) = result_stream.next().await {
batches.push(record?);
}
println!(
"Got {} batches in {:?}",
batches.len(),
Instant::now() - start
);
let response_str = pretty_format_batches(&batches)?.to_string();
println!("Query result:\n{response_str}");
Ok(())
} main
fix/concurrent_heads
|
This was referenced by @sergiimk in https://discord.com/channels/885562378132000778/1290751484807352412/1356393367566553240 (they hit the same problem and was pleased to find it fixed!) |
…apache#14918) * Fix sequential metadata fetching in ListingTable causing high latency When scanning an exact list of remote Parquet files, the ListingTable was fetching file metadata (via head calls) sequentially. This was due to using `stream::iter(file_list).flatten()`, which processes each one-item stream in order. For remote blob stores, where each head call can take tens to hundreds of milliseconds, this sequential behavior significantly increased the time to create the physical plan. This commit replaces the sequential flattening with concurrent merging using `stream::iter(file_list).flatten_unordered(meta_fetch_concurrency)`. With this change, the `head` requests are executed in parallel (up to the configured `meta_fetch_concurrency` limit), reducing latency when creating the physical plan. Note that the ordering loss introduced by `flatten_unordered` is perfectly acceptable as the file list will anyways be fully sorted by path in `split_files` before being returned. Additionally, tests have been updated to ensure that metadata fetching occurs concurrently. * fix fmt --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Rationale for this change
When scanning an exact list of remote Parquet files, the ListingTable was fetching file metadata (via head calls) sequentially. This was due to using
stream::iter(file_list).flatten()
, which processes each one-item stream in order. For remote blob stores, where each head call can take tens to hundreds of milliseconds, this sequential behavior significantly increased the time to create the physical plan.What changes are included in this PR?
This commit replaces the sequential flattening with concurrent merging using
stream::iter(file_list).flatten_unordered(meta_fetch_concurrency)
. With this change, thehead
requests are executed in parallel (up to the configuredmeta_fetch_concurrency
limit), reducing latency when creating the physical plan.Note that the ordering loss introduced by
flatten_unordered
is perfectly acceptable as the file list will anyways be fully sorted by path insplit_files
before being returned.Are these changes tested?
Tests have been updated to ensure that metadata fetching occurs concurrently.
Are there any user-facing changes?
No user-facing changes besides reducing the latency in this particular situation.