WIP [Performance] Optimize DFSchema search by field #9104

comphead · 2024-02-02T00:33:06Z

Which issue does this PR close?

Closes #.

Potentially this PR can improve performance for tickets
#5309
#7698

Rationale for this change

The PR is to start improving the DFSchema performance.
Before the PR the search for specific field in the schema was O(n) by traversing the entire fields collection.
For each iteration there were qualifiers checks, string clones etc. This routine happens for every field.

The PR introduces a hidden BTreeMap datastructure designed to search fields in the schema with O(1) complexity and datastructure will be calculated only once. And updated respectively if fields changed

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

comphead · 2024-02-02T00:34:19Z

datafusion/common/src/dfschema.rs

-        );
-        Ok(())
-    }
+    // #[test]


This test is currently incorrect IMHO, it expects the error but it should run ok

comphead · 2024-02-02T00:35:24Z

datafusion/common/src/dfschema.rs

@@ -1146,35 +1212,6 @@ mod tests {
        Ok(())
    }

-    #[allow(deprecated)]


Dropped as we have other tests duplicating this behavior, furthermore these tests referred to deprecated and already dropped methods

Jefffrey

This makes sense 👍

I am curious though, is there any advantage on using a BTreeMap over a regular HashMap? It seems we aren't really making use of the properties of a BTreeMap such as key ordering or self-balancing, as the map is always created once upfront (and during merges it's just created completely anew).

datafusion/common/src/dfschema.rs

Jefffrey · 2024-02-02T10:29:43Z

datafusion/core/src/physical_planner.rs

@@ -2242,7 +2242,7 @@ mod tests {
                dict_id: 0, \
                dict_is_ordered: false, \
                metadata: {} } }\
-        ], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [] } }, \
+        ], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [] }, fields_map: {\"a\": 0} }, \


I suppose there's no easy way to omit this new field from debug 🙁 (other than newtyping it?)

Not a big deal I guess

The Debug is derived, the only option I see is to implement Debug trait manually for DFSchema.
I'll create a followup on this.

Jefffrey · 2024-02-02T10:30:01Z

datafusion/sqllogictest/test_files/information_schema.slt

+#query TTTT rowsort
+#SELECT * from information_schema.tables WHERE datafusion.information_schema.tables.table_schema='information_schema';
+#----
+#datafusion information_schema columns VIEW
+#datafusion information_schema df_settings VIEW
+#datafusion information_schema schemata VIEW
+#datafusion information_schema tables VIEW
+#datafusion information_schema views VIEW


Was there a reason this is being commented?

Thanks @Jefffrey
ideally I think this query shouldn't work. While debugging I found the next

Error: SchemaError(FieldNotFound { field: Column { relation: Some(Full { catalog: "datafusion", schema: "information_schema", table: "tables" }), name: "table_schema" }, valid_fields: [Column { relation: Some(Partial { schema: "information_schema", table: "tables" }), name: "table_catalog" }, Column { relation: Some(Partial { schema: "information_schema", table: "tables" }), name: "table_schema" }, Column { relation: Some(Partial { schema: "information_schema", table: "tables" }), name: "table_name" }, Column { relation: Some(Partial { schema: "information_schema", table: "tables" }), name: "table_type" }] }, Some(""))

So DF asks fully qualified column name in the list of partially qualified ones., imho this test shouldn't pass.
We typically allow less qualified to find a match for more qualified like a -> t1.a or a -> schema.t1.a but not the opposite.

I'm open to discussions.

I think it should still work.

If the default catalog is datafusion, then all tables that don't have a catalog specified should implicitly fall under datafusion catalog. So querying information_schema.tables is the same as querying datafusion.information_schema.tables which means we should be able to use datafusion.information_schema.tables.table_schema in the where clause and resolve it.

I think I've worked on or seen an issue related to this before, and I agree it might be a bit confusing.

Oh you mean if the catalog is not specified, by default we should consider catalog.default_catalog value?
Would you mind if I addresss it in followup PR, I wanna keep this PR size to be reviewable

Yeah that sounds good 👍

alamb

Thanks @comphead -- this looks quite promising

I ran some benchmarks via cargo bench --bench sql_planner against this branch and 8b50774 (the merge base)

It seems to actually make performance worse. Full results
bench.log.txt. Here is a sample:

logical_select_one_from_700
                        time:   [1.2767 ms 1.2792 ms 1.2818 ms]
                        change: [+92.387% +93.053% +93.694%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  2 (2.00%) high severe

physical_select_one_from_700
                        time:   [12.250 ms 12.262 ms 12.276 ms]
                        change: [+209.66% +210.10% +210.56%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
..

It also might be worth making the map more sophisticated (such as avoiding copies on lookups and only initialize it when actually needed rather than always)

cc @matthewmturner who I think is messing around with this code

alamb · 2024-02-02T21:31:34Z

datafusion/common/src/dfschema.rs

+    /// Fields map
+    /// key - fully qualified field name
+    /// value - field index in schema
+    fields_map: BTreeMap<String, usize>,


If you use a String here it still requires a lot of copying

What if you made something more specialized like:

fields_map: BTreeMap<(Option<OwnedTableReference>, Name), usize>

I think that would then let you look up elements in the map without having to construct an owned name

Name is also a String? should this also cause a copy?
Btw why it causes copy in the first place?

String is rust is totally owned (and thus copies bytes around)

You need to have something like Arc<str> to have Java like semantics where the copies just move refcounts around

Maybe we can adust the underlying storage to Arc somehow 🤔

alamb · 2024-02-02T21:31:42Z

datafusion/common/src/dfschema.rs

@@ -237,6 +301,7 @@ impl DFSchema {
            }
        }
        self.fields.extend(fields_to_add);
+        self.fields_map = Self::get_fields_map(&self.fields);


since merge is one of the hot paths, could we update self.fields_map rather than recomputing it 🤔

datafusion/common/src/dfschema.rs

comphead · 2024-02-02T23:13:16Z

Thanks @alamb I'll try suggestions

EDIT: How do you compare results from cargo bench --bench sql_planner for different commits? You run the bench for both branches and compare benches like logical_select_one_from_700

alamb · 2024-02-03T15:00:34Z

Thanks @alamb I'll try suggestions

👍

EDIT: How do you compare results from cargo bench --bench sql_planner for different commits? You run the bench for both branches and compare benches like logical_select_one_from_700

What I did was

git checkout `git merge-base comphead/dev apache/main` # checkout place where your branch diverged from main
cargo bench --bench sql_planner
git checkout comphead/dev
cargo bench --bench sql_planner

comphead · 2024-02-05T01:04:20Z

There is a bigger problem behind, even simple sql

        let sql = "select a, a + 1 b from (select 1 a union all select 2 a)";

calls Schema::new_with_metadata 58 times and Schema::merge 35 times
I'll try to look into the problem from other side to avoid excessive schema transform calls.

comphead · 2024-02-05T01:10:19Z

Adding more input rows

        let sql = "select a, a + 1 b from (select 1 a union all select 2 a union all select 3 a)";

rows	new_with_metadata	merge
2	58	35
3	131	83
4	165	109

Looks like a problem

matthewmturner · 2024-02-05T04:13:07Z

Perhaps for this PR we could keep the focus on improving search by field performance with your initial implementation plus the points from @alamb? I am curious to see if we could still get some gains just from that and then separately look into how merge / new_with_metadata are called.

comphead · 2024-02-07T01:50:43Z

Perhaps for this PR we could keep the focus on improving search by field performance with your initial implementation plus the points from @alamb? I am curious to see if we could still get some gains just from that and then separately look into how merge / new_with_metadata are called.

Thanks @matthewmturner I just checked that we probably going into the wrong direction. So index_of_column_by_name its not a bottleneck, and its called just 6 times. The bigger problem is excessive calling schema transform functions, so this PR doesn't make sense anymore. It cannot bring up a benefit for now, I'm closing it and will create another issue to optimize the optimizer :) in schema part

comphead · 2024-02-07T02:03:21Z

Filed #9144

[WIP] optimize DfsSchema

2df8af2

github-actions bot added optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Feb 2, 2024

comphead commented Feb 2, 2024

View reviewed changes

comphead added 4 commits February 1, 2024 17:38

test with materialized map

45c4ff1

fix

fd3fdaa

fix test

eefec82

fix test

b5261c4

comphead marked this pull request as ready for review February 2, 2024 06:22

comphead changed the title ~~[WIP] optimize DFSchema~~ [Performance] Optimize DFSchema search by field Feb 2, 2024

comphead requested review from alamb and Jefffrey February 2, 2024 06:26

Jefffrey reviewed Feb 2, 2024

View reviewed changes

alamb mentioned this pull request Feb 2, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 29, 2024 #9030

Closed

6 tasks

alamb reviewed Feb 2, 2024

View reviewed changes

comphead requested a review from Jefffrey February 2, 2024 23:01

alamb marked this pull request as draft February 4, 2024 12:17

arc

1e67e95

comphead changed the title ~~[Performance] Optimize DFSchema search by field~~ WIP [Performance] Optimize DFSchema search by field Feb 6, 2024

comphead mentioned this pull request Feb 7, 2024

Optimize planner to avoid excessive schema transform functions #9144

Open

comphead closed this Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP [Performance] Optimize DFSchema search by field #9104

WIP [Performance] Optimize DFSchema search by field #9104

comphead commented Feb 2, 2024 •

edited

Loading

comphead Feb 2, 2024

comphead Feb 2, 2024

Jefffrey left a comment

Jefffrey Feb 2, 2024

comphead Feb 2, 2024

Jefffrey Feb 2, 2024

comphead Feb 2, 2024

Jefffrey Feb 2, 2024

comphead Feb 2, 2024

Jefffrey Feb 2, 2024

alamb left a comment

alamb Feb 2, 2024

comphead Feb 3, 2024

alamb Feb 3, 2024

alamb Feb 2, 2024

comphead commented Feb 2, 2024 •

edited

Loading

alamb commented Feb 3, 2024

comphead commented Feb 5, 2024

comphead commented Feb 5, 2024 •

edited

Loading

matthewmturner commented Feb 5, 2024

comphead commented Feb 7, 2024

comphead commented Feb 7, 2024

WIP [Performance] Optimize DFSchema search by field #9104

WIP [Performance] Optimize DFSchema search by field #9104

Conversation

comphead commented Feb 2, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jefffrey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead commented Feb 2, 2024 • edited Loading

alamb commented Feb 3, 2024

comphead commented Feb 5, 2024

comphead commented Feb 5, 2024 • edited Loading

matthewmturner commented Feb 5, 2024

comphead commented Feb 7, 2024

comphead commented Feb 7, 2024

comphead commented Feb 2, 2024 •

edited

Loading

comphead commented Feb 2, 2024 •

edited

Loading

comphead commented Feb 5, 2024 •

edited

Loading