-
Notifications
You must be signed in to change notification settings - Fork 1.5k
WIP [Performance] Optimize DFSchema search by field #9104
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
datafusion/common/src/dfschema.rs
Outdated
); | ||
Ok(()) | ||
} | ||
// #[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is currently incorrect IMHO, it expects the error but it should run ok
@@ -1146,35 +1212,6 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
#[allow(deprecated)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped as we have other tests duplicating this behavior, furthermore these tests referred to deprecated and already dropped methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense 👍
I am curious though, is there any advantage on using a BTreeMap
over a regular HashMap
? It seems we aren't really making use of the properties of a BTreeMap
such as key ordering or self-balancing, as the map is always created once upfront (and during merges it's just created completely anew).
@@ -2242,7 +2242,7 @@ mod tests { | |||
dict_id: 0, \ | |||
dict_is_ordered: false, \ | |||
metadata: {} } }\ | |||
], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [] } }, \ | |||
], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [] }, fields_map: {\"a\": 0} }, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there's no easy way to omit this new field from debug 🙁 (other than newtyping it?)
Not a big deal I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Debug
is derived, the only option I see is to implement Debug
trait manually for DFSchema
.
I'll create a followup on this.
#query TTTT rowsort | ||
#SELECT * from information_schema.tables WHERE datafusion.information_schema.tables.table_schema='information_schema'; | ||
#---- | ||
#datafusion information_schema columns VIEW | ||
#datafusion information_schema df_settings VIEW | ||
#datafusion information_schema schemata VIEW | ||
#datafusion information_schema tables VIEW | ||
#datafusion information_schema views VIEW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there a reason this is being commented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Jefffrey
ideally I think this query shouldn't work. While debugging I found the next
Error: SchemaError(FieldNotFound { field: Column { relation: Some(Full { catalog: "datafusion", schema: "information_schema", table: "tables" }), name: "table_schema" }, valid_fields: [Column { relation: Some(Partial { schema: "information_schema", table: "tables" }), name: "table_catalog" }, Column { relation: Some(Partial { schema: "information_schema", table: "tables" }), name: "table_schema" }, Column { relation: Some(Partial { schema: "information_schema", table: "tables" }), name: "table_name" }, Column { relation: Some(Partial { schema: "information_schema", table: "tables" }), name: "table_type" }] }, Some(""))
So DF asks fully qualified column name in the list of partially qualified ones., imho this test shouldn't pass.
We typically allow less qualified to find a match for more qualified like a -> t1.a
or a -> schema.t1.a
but not the opposite.
I'm open to discussions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should still work.
If the default catalog is datafusion
, then all tables that don't have a catalog specified should implicitly fall under datafusion
catalog. So querying information_schema.tables
is the same as querying datafusion.information_schema.tables
which means we should be able to use datafusion.information_schema.tables.table_schema
in the where clause and resolve it.
I think I've worked on or seen an issue related to this before, and I agree it might be a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you mean if the catalog is not specified, by default we should consider catalog.default_catalog
value?
Would you mind if I addresss it in followup PR, I wanna keep this PR size to be reviewable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @comphead -- this looks quite promising
I ran some benchmarks via cargo bench --bench sql_planner
against this branch and 8b50774 (the merge base)
It seems to actually make performance worse. Full results
bench.log.txt. Here is a sample:
logical_select_one_from_700
time: [1.2767 ms 1.2792 ms 1.2818 ms]
change: [+92.387% +93.053% +93.694%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low severe
1 (1.00%) high mild
2 (2.00%) high severe
physical_select_one_from_700
time: [12.250 ms 12.262 ms 12.276 ms]
change: [+209.66% +210.10% +210.56%] (p = 0.00 < 0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) high mild
4 (4.00%) high severe
..
It also might be worth making the map more sophisticated (such as avoiding copies on lookups and only initialize it when actually needed rather than always)
cc @matthewmturner who I think is messing around with this code
datafusion/common/src/dfschema.rs
Outdated
/// Fields map | ||
/// key - fully qualified field name | ||
/// value - field index in schema | ||
fields_map: BTreeMap<String, usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use a String
here it still requires a lot of copying
What if you made something more specialized like:
fields_map: BTreeMap<(Option<OwnedTableReference>, Name), usize>
I think that would then let you look up elements in the map without having to construct an owned name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name is also a String? should this also cause a copy?
Btw why it causes copy in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String is rust is totally owned (and thus copies bytes around)
You need to have something like Arc<str>
to have Java like semantics where the copies just move refcounts around
Maybe we can adust the underlying storage to Arc somehow 🤔
@@ -237,6 +301,7 @@ impl DFSchema { | |||
} | |||
} | |||
self.fields.extend(fields_to_add); | |||
self.fields_map = Self::get_fields_map(&self.fields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since merge is one of the hot paths, could we update self.fields_map rather than recomputing it 🤔
Thanks @alamb I'll try suggestions EDIT: How do you compare results from |
👍
What I did was git checkout `git merge-base comphead/dev apache/main` # checkout place where your branch diverged from main
cargo bench --bench sql_planner
git checkout comphead/dev
cargo bench --bench sql_planner |
There is a bigger problem behind, even simple sql
calls |
Adding more input rows
Looks like a problem |
Perhaps for this PR we could keep the focus on improving search by field performance with your initial implementation plus the points from @alamb? I am curious to see if we could still get some gains just from that and then separately look into how |
Thanks @matthewmturner I just checked that we probably going into the wrong direction. So |
Filed #9144 |
Which issue does this PR close?
Closes #.
Potentially this PR can improve performance for tickets
#5309
#7698
Rationale for this change
The PR is to start improving the DFSchema performance.
Before the PR the search for specific field in the schema was O(n) by traversing the entire fields collection.
For each iteration there were qualifiers checks, string clones etc. This routine happens for every field.
The PR introduces a hidden BTreeMap datastructure designed to search fields in the schema with O(1) complexity and datastructure will be calculated only once. And updated respectively if fields changed
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?