Skip to content

Commit 414e7a3

Browse files
authored
docs: Add benchmarking guide (apache#444)
* add benchmarking guide * add ASF header
1 parent fe071e0 commit 414e7a3

File tree

2 files changed

+63
-0
lines changed

2 files changed

+63
-0
lines changed
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Comet Benchmarking Guide
21+
22+
To track progress on performance, we regularly run benchmarks derived from TPC-H and TPC-DS. Benchmarking scripts are
23+
available in the [DataFusion Benchmarks](https://github.com/apache/datafusion-benchmarks) GitHub repository.
24+
25+
Here is an example command for running the benchmarks. This command will need to be adapted based on the Spark
26+
environment and location of data files.
27+
28+
This command assumes that `datafusion-benchmarks` is checked out in a parallel directory to `datafusion-comet`.
29+
30+
```shell
31+
$SPARK_HOME/bin/spark-submit \
32+
--master "local[*]" \
33+
--conf spark.driver.memory=8G \
34+
--conf spark.executor.memory=64G \
35+
--conf spark.executor.cores=16 \
36+
--conf spark.cores.max=16 \
37+
--conf spark.eventLog.enabled=true \
38+
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
39+
--jars $COMET_JAR \
40+
--conf spark.driver.extraClassPath=$COMET_JAR \
41+
--conf spark.executor.extraClassPath=$COMET_JAR \
42+
--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
43+
--conf spark.comet.enabled=true \
44+
--conf spark.comet.exec.enabled=true \
45+
--conf spark.comet.exec.all.enabled=true \
46+
--conf spark.comet.cast.allowIncompatible=true \
47+
--conf spark.comet.explainFallback.enabled=true \
48+
--conf spark.comet.parquet.io.enabled=false \
49+
--conf spark.comet.batchSize=8192 \
50+
--conf spark.comet.columnar.shuffle.enabled=false \
51+
--conf spark.comet.exec.shuffle.enabled=true \
52+
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
53+
--conf spark.sql.adaptive.coalescePartitions.enabled=false \
54+
--conf spark.comet.shuffle.enforceMode.enabled=true \
55+
../datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
56+
--benchmark tpch \
57+
--data /mnt/bigdata/tpch/sf100-parquet/ \
58+
--queries ../datafusion-benchmarks/tpch/queries
59+
```
60+
61+
Comet performance can be compared to regular Spark performance by running the benchmark twice, once with
62+
`spark.comet.enabled` set to `true` and once with it set to `false`.

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ as a native runtime to achieve improvement in terms of query efficiency and quer
5858
Comet Plugin Overview <contributor-guide/plugin_overview>
5959
Development Guide <contributor-guide/development>
6060
Debugging Guide <contributor-guide/debugging>
61+
Benchmarking Guide <contributor-guide/benchmarking>
6162
Profiling Native Code <contributor-guide/profiling_native_code>
6263
Github and Issue Tracker <https://github.com/apache/datafusion-comet>
6364

0 commit comments

Comments
 (0)