Skip to content

Minor: Remove clone in transform_to_states #12707

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 2 commits into from
Oct 2, 2024

Conversation

jayzhan211
Copy link
Contributor

Which issue does this PR close?

While working on #12697, I discovered the change. Since this change is not negligible, I’ve separated it to ensure more accurate benchmarking results for #12697.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Benchmark

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ rm-clone-v4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.53ms │      0.42ms │ +1.25x faster │
│ QQuery 1     │    45.57ms │     44.49ms │     no change │
│ QQuery 2     │    88.79ms │     80.65ms │ +1.10x faster │
│ QQuery 3     │    74.46ms │     67.41ms │ +1.10x faster │
│ QQuery 4     │   423.98ms │    391.22ms │ +1.08x faster │
│ QQuery 5     │   678.29ms │    673.73ms │     no change │
│ QQuery 6     │    40.60ms │     37.46ms │ +1.08x faster │
│ QQuery 7     │    45.41ms │     40.97ms │ +1.11x faster │
│ QQuery 8     │   614.83ms │    605.85ms │     no change │
│ QQuery 9     │   691.24ms │    665.57ms │     no change │
│ QQuery 10    │   207.39ms │    195.89ms │ +1.06x faster │
│ QQuery 11    │   226.88ms │    212.60ms │ +1.07x faster │
│ QQuery 12    │   738.67ms │    713.79ms │     no change │
│ QQuery 13    │   933.90ms │    879.69ms │ +1.06x faster │
│ QQuery 14    │   852.38ms │    821.42ms │     no change │
│ QQuery 15    │   507.49ms │    503.20ms │     no change │
│ QQuery 16    │  1722.33ms │   1310.97ms │ +1.31x faster │
│ QQuery 17    │  1422.20ms │   1175.26ms │ +1.21x faster │
│ QQuery 18    │  4047.53ms │   3119.16ms │ +1.30x faster │
│ QQuery 19    │    55.00ms │     56.74ms │     no change │
│ QQuery 20    │  1015.81ms │    922.51ms │ +1.10x faster │
│ QQuery 21    │  1217.63ms │   1213.17ms │     no change │
│ QQuery 22    │  3347.57ms │   3168.49ms │ +1.06x faster │
│ QQuery 23    │  8226.45ms │   7974.39ms │     no change │
│ QQuery 24    │   488.81ms │    510.23ms │     no change │
│ QQuery 25    │   490.35ms │    493.62ms │     no change │
│ QQuery 26    │   566.92ms │    553.18ms │     no change │
│ QQuery 27    │  1475.90ms │   1396.85ms │ +1.06x faster │
│ QQuery 28    │ 10447.53ms │  10713.14ms │     no change │
│ QQuery 29    │   395.86ms │    391.29ms │     no change │
│ QQuery 30    │   757.40ms │    707.42ms │ +1.07x faster │
│ QQuery 31    │   684.84ms │    653.02ms │     no change │
│ QQuery 32    │  3911.74ms │   3190.83ms │ +1.23x faster │
│ QQuery 33    │  4077.03ms │   3500.06ms │ +1.16x faster │
│ QQuery 34    │  4718.41ms │   3702.13ms │ +1.27x faster │
│ QQuery 35    │  1088.16ms │   1001.00ms │ +1.09x faster │
│ QQuery 36    │   149.16ms │    146.17ms │     no change │
│ QQuery 37    │   103.00ms │    102.51ms │     no change │
│ QQuery 38    │   108.70ms │    107.61ms │     no change │
│ QQuery 39    │   324.72ms │    311.87ms │     no change │
│ QQuery 40    │    32.81ms │     32.83ms │     no change │
│ QQuery 41    │    32.10ms │     31.78ms │     no change │
│ QQuery 42    │    40.00ms │     39.60ms │     no change │
└──────────────┴────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)          │ 57118.33ms │
│ Total Time (rm-clone-v4)   │ 52460.22ms │
│ Average Time (main)        │  1328.33ms │
│ Average Time (rm-clone-v4) │  1220.01ms │
│ Queries Faster             │         20 │
│ Queries Slower             │          0 │
│ Queries with No Change     │         23 │
└────────────────────────────┴────────────┘

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Oct 2, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211 -- makes sense to me

I reminded myself about RecordBatch:
https://docs.rs/arrow-array/53.0.0/src/arrow_array/record_batch.rs.html#72

It does indeed appear that cloneing a record batch requires cloneing a Vec (not just some referenc counts).

Now that I see that I wonder if we should contemplate changing RecordBatch to not use Vec 🤔 (it could e.g. use Arc<[]>)

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Oct 2, 2024

Now that I see that I wonder if we should contemplate changing RecordBatch to not use Vec 🤔 (it could e.g. use Arc<[]>)

Maybe, we need benchmark to confirm it.

  • I think we need to check whether we have a lot of unavoidable clone
  • And, verify the Arc<[]> did save the cost

@jayzhan211 jayzhan211 merged commit a515dec into apache:main Oct 2, 2024
24 checks passed
@jayzhan211 jayzhan211 deleted the rm-clone-v4 branch October 2, 2024 11:31
@alamb
Copy link
Contributor

alamb commented Oct 2, 2024

Maybe, we need benchmark to confirm it.

100%

alamb added a commit that referenced this pull request Oct 8, 2024
* Add support for external tables with qualified names (#12645)

* Make  support schemas

* Set default name to table

* Remove print statements and stale comment

* Add tests for create table

* Fix typo

* Update datafusion/sql/src/statement.rs

Co-authored-by: Jonah Gao <jonahgao@msn.com>

* convert create_external_table to objectname

* Add sqllogic tests

* Fix failing tests

---------

Co-authored-by: Jonah Gao <jonahgao@msn.com>

* Fix Regex signature types (#12690)

* Fix Regex signature types

* Uncomment the shared tests in string_query.slt.part and removed tests copies everywhere else

* Test `LIKE` and `MATCH` with flags; Remove new tests from regexp.slt

* Refactor `ByteGroupValueBuilder` to use `MaybeNullBufferBuilder` (#12681)

* Fix malformed hex string literal in docs (#12708)

* Simplify match patterns in coercion rules (#12711)

Remove conditions where unnecessary.
Refactor to improve readability.

* Remove aggregate functions dependency on frontend (#12715)

* Remove aggregate functions dependency on frontend

DataFusion is a SQL query engine and also a reusable library for
building query engines. The core functionality should not depend on
frontend related functionalities like `sqlparser` or `datafusion-sql`.

* Remove duplicate license header

* Minor: Remove clone in `transform_to_states` (#12707)

* rm clone

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fmt

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

---------

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* Refactor tests for union sorting properties, add tests for unions and constants (#12702)

* Refactor tests for union sorting properties

* update doc test

* Undo import reordering

* remove unecessary static lifetimes

* Fix: support Qualified Wildcard in count aggregate function (#12673)

* Reduce code duplication in `PrimitiveGroupValueBuilder` with const generics (#12703)

* Reduce code duplication in `PrimitiveGroupValueBuilder` with const generics

* Fix docs

* Disallow duplicated qualified field names (#12608)

* Disallow duplicated qualified field names

* Fix tests

* Optimize base64/hex decoding by pre-allocating output buffers (~2x faster) (#12675)

* add bench

* replace macro with generic function

* remove duplicated code

* optimize base64/hex decode

* Allow DynamicFileCatalog support to query partitioned file (#12683)

* support to query partitioned table for dynamic file catalog

* cargo clippy

* split partitions inferring to another function

* Support `LIMIT` Push-down logical plan optimization for `Extension` nodes (#12685)

* Update trait `UserDefinedLogicalNodeCore`

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Update corresponding interface

Signed-off-by: Austin Liu <austin362667@gmail.com>

Add rewrite rule for `push-down-limit` for `Extension`

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Add rewrite rule for `push-down-limit` for `Extension` and tests

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Update corresponding interface

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Reorganize to match guard

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Clena up

Signed-off-by: Austin Liu <austin362667@gmail.com>

Clean up

Signed-off-by: Austin Liu <austin362667@gmail.com>

---------

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Fix AvroReader: Add union resolving for nested struct arrays (#12686)

* Add union resolving for nested struct arrays

* Add test

* Change test

* Reproduce index error

* fmt

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Adds macros for creating `WindowUDF` and `WindowFunction` expression (#12693)

* Adds macro for udwf singleton

* Adds a doc comment parameter to macro

* Add doc comment for `create_udwf` macro

* Uses default constructor

* Update `Cargo.lock` in `datafusion-cli`

* Fixes: expand `$FN_NAME` in doc strings

* Adds example for macro usage

* Renames macro

* Improve doc comments

* Rename udwf macro

* Minor: doc copy edits

* Adds macro for creating fluent-style expression API

* Adds support for 1 or more parameters in expression function

* Rewrite doc comments

* Rename parameters

* Minor: formatting

* Adds doc comment for `create_udwf_expr` macro

* Improve example docs

* Hides extraneous code in doc comments

* Add a one-line readme

* Adds doc test assertions + minor formatting fixes

* Adds common macro for defining user-defined window functions

* Adds doc comment for `define_udwf_and_expr`

* Defines `RowNumber` using common macro

* Add usage example for common macro

* Adds usage for custom constructor

* Add examples for remaining patterns

* Improve doc comments for usage examples

* Rewrite inner line docs

* Rewrite `create_udwf_expr!` doc comments

* Minor doc improvements

* Fix doc test and usage example

* Add inline comments for macro patterns

* Minor: change doc comment in example

* Support unparsing plans with both Aggregation and Window functions (#12705)

* Support unparsing plans with both Aggregation and Window functions (#35)

* Fix unparsing for aggregation grouping sets

* Add test for grouping set unparsing

* Update datafusion/sql/src/unparser/utils.rs

Co-authored-by: Jax Liu <liugs963@gmail.com>

* Update datafusion/sql/src/unparser/utils.rs

Co-authored-by: Jax Liu <liugs963@gmail.com>

* Update

* More tests

---------

Co-authored-by: Jax Liu <liugs963@gmail.com>

* Fix strpos invocation with dictionary and null (#12712)

In 1b3608d `strpos` signature was
modified to indicate it supports dictionary as input argument, but the
invoke method doesn't support them.

* docs: Update DataFusion introduction to clarify that DataFusion does provide an "out of the box" query engine (#12666)

* Update DataFusion introduction to show that DataFusion offers packaged versions for end users

* change order

* Update README.md

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* refine wording and update user guide for consistency

* prettier

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Framework for generating function docs from embedded code documentation (#12668)

* Initial work on #12432 to allow for generation of udf docs from embedded documentation in the code

* Add missing license header.

* Fixed examples.

* Fixing a really weird RustRover/wsl ... something. No clue what happened there.

* permission change

* Cargo fmt update.

* Refactored Documentation to allow it to be used in a const.

* Add documentation for syntax_example

* Refactoring Documentation based on PR feedback.

* Cargo fmt update.

* Doc update

* Fixed copy/paste error.

* Minor text updates.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Add IMDB(JOB) Benchmark [2/N] (imdb queries) (#12529)

* imdb dataset

* cargo fmt

* Add 113 queries for IMDB(JOB)

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Add `get_query_sql` from `query_id` string

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Fix CSV reader & Remove Parquet partition

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Add benchmark IMDB runner

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Add `run_imdb` script

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Add checker for imdb option

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Add SLT for IMDB

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Fix `get_query_sql()` for CI roundtrip test

Signed-off-by: Austin Liu <austin362667@gmail.com>

Fix `get_query_sql()` for CI roundtrip test

Signed-off-by: Austin Liu <austin362667@gmail.com>

Fix `get_query_sql()` for CI roundtrip test

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Clean up

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Add missing license

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Add IMDB(JOB) queries `2b` to `5c`

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Add `INCLUDE_IMDB` in CI verify-benchmark-results

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Prepare IMDB dataset

Signed-off-by: Austin Liu <austin362667@gmail.com>

Prepare IMDB dataset

Signed-off-by: Austin Liu <austin362667@gmail.com>

* use uint as id type

* format

* Seperate `tpch` and `imdb` benchmarking CI jobs

Signed-off-by: Austin Liu <austin362667@gmail.com>

Fix path

Signed-off-by: Austin Liu <austin362667@gmail.com>

Fix path

Signed-off-by: Austin Liu <austin362667@gmail.com>

Remove `tpch` in `imdb` benchmark

Signed-off-by: Austin Liu <austin362667@gmail.com>

* Remove IMDB(JOB) slt in CI

Signed-off-by: Austin Liu <austin362667@gmail.com>

Remove IMDB(JOB) slt in CI

Signed-off-by: Austin Liu <austin362667@gmail.com>

---------

Signed-off-by: Austin Liu <austin362667@gmail.com>
Co-authored-by: DouPache <douenergy@gmail.com>

* Minor: avoid clone while calculating union equivalence properties (#12722)

* Minor: avoid clone while calculating union equivalence properties

* Update datafusion/physical-expr/src/equivalence/properties.rs

* fmt

* Simplify streaming_merge function parameters (#12719)

* simplify streaming_merge function parameters

* revert test change

* change StreamingMergeConfig into builder pattern

* Fix links on docs index page (#12750)

* Provide field and schema metadata missing on cross joins, and union with null fields. (#12729)

* test: reproducer for missing schema metadata on cross join

* fix: pass thru schema metadata on cross join

* fix: preserve metadata when transforming to view types

* test: reproducer for missing field metadata in left hand NULL field of union

* fix: preserve field metadata from right side of union

* chore: safe indexing

* Minor: Update string tests for strpos (#12739)

* Apply `type_union_resolution` to array and values (#12753)

* cleanup make array coercion rule

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* change to type union resolution

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* change value too

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fix tpyo

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

---------

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* Add `DocumentationBuilder::with_standard_argument` to reduce copy/paste (#12747)

* Add `DocumentationBuilder::with_standard_expression` to reduce copy/paste

* fix doc

* fix standard argument

* Update docs

* Improve documentation to explain what is different

* fix `equal_to` in `PrimitiveGroupValueBuilder` (#12758)

* fix `equal_to` in `PrimitiveGroupValueBuilder`.

* fix typo.

* add uts.

* reduce calling of `is_null`.

* Minor: doc how field name is to be set (#12757)

* Fix `equal_to` in `ByteGroupValueBuilder` (#12770)

* Fix `equal_to` in `ByteGroupValueBuilder`

* refactor null_equal_to

* Update datafusion/physical-plan/src/aggregates/group_values/group_column.rs

* Allow simplification even when nullable (#12746)

The nullable requirement seem to have been added in #1401 but as far as
I can tell they are not needed for these 2 cases.

I think this can be shown using this truth table: (generated using
datafusion-cli without this patch)
```
> CREATE TABLE t (v BOOLEAN) as values (true), (false), (NULL);
> select t.v, t2.v, t.v AND (t.v OR t2.v), t.v OR (t.v AND t2.v) from t cross join t as t2;
+-------+-------+---------------------+---------------------+
| v     | v     | t.v AND t.v OR t2.v | t.v OR t.v AND t2.v |
+-------+-------+---------------------+---------------------+
| true  | true  | true                | true                |
| true  | false | true                | true                |
| true  |       | true                | true                |
| false | true  | false               | false               |
| false | false | false               | false               |
| false |       | false               | false               |
|       | true  |                     |                     |
|       | false |                     |                     |
|       |       |                     |                     |
+-------+-------+---------------------+---------------------+
```

And it seems Spark applies both of these and DuckDB applies only the
first one.

* Fix unnest conjunction with selecting wildcard expression (#12760)

* fix unnest statement with wildcard expression

* add commnets

* Improve `round` scalar function unparsing for Postgres (#12744)

* Postgres: enforce required `NUMERIC` type for `round` scalar function (#34)

Includes initial support for dialects to override scalar functions unparsing

* Document scalar_function_to_sql_overrides fn

* Fix stack overflow calculating projected orderings (#12759)

* Fix stack overflow calculating projected orderings

* fix docs

* Port / Add Documentation for `VarianceSample` and `VariancePopulation` (#12742)

* Upgrade arrow/parquet to `53.1.0` / fix clippy (#12724)

* Update to arrow/parquet 53.1.0

* Update some API

* update for changed file sizes

* Use non deprecated APIs

* Use ParquetMetadataReader from @etseidl

* remove upstreamed implementation

* Update CSV schema

* Use upstream is_null and is_not_null kernels

* feat: add support for Substrait ExtendedExpression (#12728)

* Add support for serializing and deserializing Substrait ExtendedExpr message

* Address clippy reviews

* Reuse existing rename method

* Transformed::new_transformed: Fix documentation formatting (#12787)

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* fix: Correct results for grouping sets when columns contain nulls (#12571)

* Fix grouping sets behavior when data contains nulls

* PR suggestion comment

* Update new test case

* Add grouping_id to the logical plan

* Add doc comment next to INTERNAL_GROUPING_ID

* Fix unparsing of Aggregate with grouping sets

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Migrate documentation for all string functions from scalar_functions.md to code  (#12775)

* Added documentation for string and unicode functions.

* Fixed issues with aliases.

* Cargo fmt.

* Minor doc fixes.

* Update docs for var_pop/samp

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Account for constant equivalence properties in union, tests (#12562)

* Minor: clarify comment about empty dependencies (#12786)

* Introduce Signature::String and return error if  input of `strpos` is integer (#12751)

* fix sig

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fix

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fix error

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fix all signature

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fix all signature

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* change default type

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* clippy

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fix docs

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* rm deadcode

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* rm test

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

---------

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* Minor: improve docs on MovingMin/MovingMax (#12790)

* Add slt tests (#12721)

---------

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: Austin Liu <austin362667@gmail.com>
Co-authored-by: OussamaSaoudi <45303303+OussamaSaoudi@users.noreply.github.com>
Co-authored-by: Jonah Gao <jonahgao@msn.com>
Co-authored-by: Dmitrii Blaginin <dmitrii@blaginin.me>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Tomoaki Kawada <kawada@kmckk.co.jp>
Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>
Co-authored-by: Jay Zhan <jayzhan211@gmail.com>
Co-authored-by: HuSen <husen.xjtu@gmail.com>
Co-authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@gmail.com>
Co-authored-by: Simon Vandel Sillesen <simon.vandel@gmail.com>
Co-authored-by: Jax Liu <liugs963@gmail.com>
Co-authored-by: Austin Liu <austin362667@gmail.com>
Co-authored-by: JonasDev1 <jswipp@googlemail.com>
Co-authored-by: jcsherin <jacob@protoship.io>
Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com>
Co-authored-by: Andy Grove <agrove@apache.org>
Co-authored-by: Bruce Ritchie <bruce.ritchie@veeva.com>
Co-authored-by: DouPache <douenergy@gmail.com>
Co-authored-by: mertak-synnada <mertak67+synaada@gmail.com>
Co-authored-by: Bryce Mecum <petridish@gmail.com>
Co-authored-by: wiedld <wiedld@users.noreply.github.com>
Co-authored-by: kamille <caoruiqiu.crq@antgroup.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Val Lorentz <vlorentz@softwareheritage.org>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
physical-expr Changes to the physical-expr crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants