Skip to content

WIP [Performance] Optimize DFSchema search by field #9104

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
wants to merge 6 commits into from

Conversation

comphead
Copy link
Contributor

@comphead comphead commented Feb 2, 2024

Which issue does this PR close?

Closes #.

Potentially this PR can improve performance for tickets
#5309
#7698

Rationale for this change

The PR is to start improving the DFSchema performance.
Before the PR the search for specific field in the schema was O(n) by traversing the entire fields collection.
For each iteration there were qualifiers checks, string clones etc. This routine happens for every field.

The PR introduces a hidden BTreeMap datastructure designed to search fields in the schema with O(1) complexity and datastructure will be calculated only once. And updated respectively if fields changed

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Feb 2, 2024
);
Ok(())
}
// #[test]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is currently incorrect IMHO, it expects the error but it should run ok

@@ -1146,35 +1212,6 @@ mod tests {
Ok(())
}

#[allow(deprecated)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped as we have other tests duplicating this behavior, furthermore these tests referred to deprecated and already dropped methods

@comphead comphead marked this pull request as ready for review February 2, 2024 06:22
@comphead comphead changed the title [WIP] optimize DFSchema [Performance] Optimize DFSchema search by field Feb 2, 2024
@comphead comphead requested review from alamb and Jefffrey February 2, 2024 06:26
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense 👍

I am curious though, is there any advantage on using a BTreeMap over a regular HashMap? It seems we aren't really making use of the properties of a BTreeMap such as key ordering or self-balancing, as the map is always created once upfront (and during merges it's just created completely anew).

@@ -2242,7 +2242,7 @@ mod tests {
dict_id: 0, \
dict_is_ordered: false, \
metadata: {} } }\
], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [] } }, \
], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [] }, fields_map: {\"a\": 0} }, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there's no easy way to omit this new field from debug 🙁 (other than newtyping it?)

Not a big deal I guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Debug is derived, the only option I see is to implement Debug trait manually for DFSchema.
I'll create a followup on this.

Comment on lines +118 to +125
#query TTTT rowsort
#SELECT * from information_schema.tables WHERE datafusion.information_schema.tables.table_schema='information_schema';
#----
#datafusion information_schema columns VIEW
#datafusion information_schema df_settings VIEW
#datafusion information_schema schemata VIEW
#datafusion information_schema tables VIEW
#datafusion information_schema views VIEW
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a reason this is being commented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jefffrey
ideally I think this query shouldn't work. While debugging I found the next

Error: SchemaError(FieldNotFound { field: Column { relation: Some(Full { catalog: "datafusion", schema: "information_schema", table: "tables" }), name: "table_schema" }, valid_fields: [Column { relation: Some(Partial { schema: "information_schema", table: "tables" }), name: "table_catalog" }, Column { relation: Some(Partial { schema: "information_schema", table: "tables" }), name: "table_schema" }, Column { relation: Some(Partial { schema: "information_schema", table: "tables" }), name: "table_name" }, Column { relation: Some(Partial { schema: "information_schema", table: "tables" }), name: "table_type" }] }, Some(""))

So DF asks fully qualified column name in the list of partially qualified ones., imho this test shouldn't pass.
We typically allow less qualified to find a match for more qualified like a -> t1.a or a -> schema.t1.a but not the opposite.

I'm open to discussions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should still work.

If the default catalog is datafusion, then all tables that don't have a catalog specified should implicitly fall under datafusion catalog. So querying information_schema.tables is the same as querying datafusion.information_schema.tables which means we should be able to use datafusion.information_schema.tables.table_schema in the where clause and resolve it.

I think I've worked on or seen an issue related to this before, and I agree it might be a bit confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you mean if the catalog is not specified, by default we should consider catalog.default_catalog value?
Would you mind if I addresss it in followup PR, I wanna keep this PR size to be reviewable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds good 👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @comphead -- this looks quite promising

I ran some benchmarks via cargo bench --bench sql_planner against this branch and 8b50774 (the merge base)

It seems to actually make performance worse. Full results
bench.log.txt. Here is a sample:

logical_select_one_from_700
                        time:   [1.2767 ms 1.2792 ms 1.2818 ms]
                        change: [+92.387% +93.053% +93.694%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  2 (2.00%) high severe

physical_select_one_from_700
                        time:   [12.250 ms 12.262 ms 12.276 ms]
                        change: [+209.66% +210.10% +210.56%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
..

It also might be worth making the map more sophisticated (such as avoiding copies on lookups and only initialize it when actually needed rather than always)

cc @matthewmturner who I think is messing around with this code

/// Fields map
/// key - fully qualified field name
/// value - field index in schema
fields_map: BTreeMap<String, usize>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use a String here it still requires a lot of copying

What if you made something more specialized like:

fields_map: BTreeMap<(Option<OwnedTableReference>, Name), usize>

I think that would then let you look up elements in the map without having to construct an owned name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name is also a String? should this also cause a copy?
Btw why it causes copy in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String is rust is totally owned (and thus copies bytes around)

You need to have something like Arc<str> to have Java like semantics where the copies just move refcounts around

Maybe we can adust the underlying storage to Arc somehow 🤔

@@ -237,6 +301,7 @@ impl DFSchema {
}
}
self.fields.extend(fields_to_add);
self.fields_map = Self::get_fields_map(&self.fields);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since merge is one of the hot paths, could we update self.fields_map rather than recomputing it 🤔

@comphead comphead requested a review from Jefffrey February 2, 2024 23:01
@comphead
Copy link
Contributor Author

comphead commented Feb 2, 2024

Thanks @alamb I'll try suggestions

EDIT: How do you compare results from cargo bench --bench sql_planner for different commits? You run the bench for both branches and compare benches like logical_select_one_from_700

@alamb
Copy link
Contributor

alamb commented Feb 3, 2024

Thanks @alamb I'll try suggestions

👍

EDIT: How do you compare results from cargo bench --bench sql_planner for different commits? You run the bench for both branches and compare benches like logical_select_one_from_700

What I did was

git checkout `git merge-base comphead/dev apache/main` # checkout place where your branch diverged from main
cargo bench --bench sql_planner
git checkout comphead/dev
cargo bench --bench sql_planner 

@alamb alamb marked this pull request as draft February 4, 2024 12:17
@comphead
Copy link
Contributor Author

comphead commented Feb 5, 2024

There is a bigger problem behind, even simple sql

        let sql = "select a, a + 1 b from (select 1 a union all select 2 a)";

calls Schema::new_with_metadata 58 times and Schema::merge 35 times
I'll try to look into the problem from other side to avoid excessive schema transform calls.

@comphead
Copy link
Contributor Author

comphead commented Feb 5, 2024

Adding more input rows

        let sql = "select a, a + 1 b from (select 1 a union all select 2 a union all select 3 a)";
rows new_with_metadata merge
2 58 35
3 131 83
4 165 109

Looks like a problem

@matthewmturner
Copy link
Contributor

Perhaps for this PR we could keep the focus on improving search by field performance with your initial implementation plus the points from @alamb? I am curious to see if we could still get some gains just from that and then separately look into how merge / new_with_metadata are called.

@comphead comphead changed the title [Performance] Optimize DFSchema search by field WIP [Performance] Optimize DFSchema search by field Feb 6, 2024
@comphead
Copy link
Contributor Author

comphead commented Feb 7, 2024

Perhaps for this PR we could keep the focus on improving search by field performance with your initial implementation plus the points from @alamb? I am curious to see if we could still get some gains just from that and then separately look into how merge / new_with_metadata are called.

Thanks @matthewmturner I just checked that we probably going into the wrong direction. So index_of_column_by_name its not a bottleneck, and its called just 6 times. The bigger problem is excessive calling schema transform functions, so this PR doesn't make sense anymore. It cannot bring up a benefit for now, I'm closing it and will create another issue to optimize the optimizer :) in schema part

@comphead
Copy link
Contributor Author

comphead commented Feb 7, 2024

Filed #9144

@comphead comphead closed this Feb 7, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
core Core DataFusion crate optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants