Add parallelism to parseBytes and transform #400

benjben · 2025-02-13T13:42:33Z

For pods with more than 1 CPU, tests have shown that we get a better throughput when adding some parallelism on CPU-intensive steps.

Actually, tests have shown that we get the best throughput when adding parallelism to transform step only.

I wonder if we should remove the parallelism on parseBytes, but given that it is there now, I'd be in favor of keeping it with a low default, so that there is no parallelism, and if 1 day we want to change it we can.

Also, we've observed better throughput with writeBatchConcurrency = 2, so I updated the default.

modules/core/src/main/scala/com.snowplowanalytics.snowplow.bigquery/processing/Processing.scala

For pods with more than 1 CPU, tests have shown that we get a better throughput when adding some parallelism on CPU-intensive steps. Actually, tests have shown that we get the best throughput when adding parallelism to transform step only. I wonder if we should remove the parallelism on parseBytes, but given that it is there now, I'd be in favor of keeping it with a low default, so that there is no parallelism, and if 1 day we want to change it we can. Also, we've observed better throughput with writeBatchConcurrency = 2, so I updated the default.

- Update license to SLULA 1.1 - Cluster by event_name when creating new table (#402) - Add parallelism to parseBytes and transform (#400) - Decrease default batching.maxBytes to 10 MB (#398) - Fix and improve ProcessingSpec for legacy column mode (#396) - Add legacyColumnMode configuration (#394) - Add e2e_latency_millis metric (#391) - Fix startup on missing existing table (#384) - Add option to exit on missing Iglu schemas (#382) - Refactor health monitoring (#381) - Feature flag to support the legacy column style -- bug fixes (#379 #380) - Require alter table when schema is evolved for contexts - Allow for delay in Writer discovering new columns - Stay healthy if BigQuery table exceeds column limit (#372) - Recover from server-side schema mismatch exceptions - Improve exception handling immediately after altering the table - Manage Writer resource to be consistent with Snowflake Loader

benjben added a commit that referenced this pull request Feb 13, 2025

Add parallelism to parseBytes and transform (#400)

82e9b9c

benjben force-pushed the parallelism branch from 8970316 to 82e9b9c Compare February 13, 2025 13:43

istreeter requested changes Feb 13, 2025

View reviewed changes

modules/core/src/main/scala/com.snowplowanalytics.snowplow.bigquery/processing/Processing.scala Outdated Show resolved Hide resolved

benjben added a commit that referenced this pull request Feb 13, 2025

Add parallelism to parseBytes and transform (#400)

512fefb

benjben force-pushed the parallelism branch from 82e9b9c to 512fefb Compare February 13, 2025 17:24

benjben requested a review from istreeter February 13, 2025 17:27

istreeter approved these changes Feb 13, 2025

View reviewed changes

benjben force-pushed the parallelism branch from 512fefb to f63bd07 Compare February 14, 2025 08:50

benjben merged commit 13292fc into v2 Feb 14, 2025
2 checks passed

benjben deleted the parallelism branch February 14, 2025 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallelism to parseBytes and transform #400

Add parallelism to parseBytes and transform #400

benjben commented Feb 13, 2025

Add parallelism to parseBytes and transform #400

Add parallelism to parseBytes and transform #400

Conversation

benjben commented Feb 13, 2025