-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add parallelism to parseBytes and transform #400
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
benjben
added a commit
that referenced
this pull request
Feb 13, 2025
istreeter
requested changes
Feb 13, 2025
modules/core/src/main/scala/com.snowplowanalytics.snowplow.bigquery/processing/Processing.scala
Outdated
Show resolved
Hide resolved
istreeter
approved these changes
Feb 13, 2025
For pods with more than 1 CPU, tests have shown that we get a better throughput when adding some parallelism on CPU-intensive steps. Actually, tests have shown that we get the best throughput when adding parallelism to transform step only. I wonder if we should remove the parallelism on parseBytes, but given that it is there now, I'd be in favor of keeping it with a low default, so that there is no parallelism, and if 1 day we want to change it we can. Also, we've observed better throughput with writeBatchConcurrency = 2, so I updated the default.
benjben
added a commit
that referenced
this pull request
Feb 14, 2025
For pods with more than 1 CPU, tests have shown that we get a better throughput when adding some parallelism on CPU-intensive steps. Actually, tests have shown that we get the best throughput when adding parallelism to transform step only. I wonder if we should remove the parallelism on parseBytes, but given that it is there now, I'd be in favor of keeping it with a low default, so that there is no parallelism, and if 1 day we want to change it we can. Also, we've observed better throughput with writeBatchConcurrency = 2, so I updated the default.
benjben
added a commit
that referenced
this pull request
Feb 14, 2025
For pods with more than 1 CPU, tests have shown that we get a better throughput when adding some parallelism on CPU-intensive steps. Actually, tests have shown that we get the best throughput when adding parallelism to transform step only. I wonder if we should remove the parallelism on parseBytes, but given that it is there now, I'd be in favor of keeping it with a low default, so that there is no parallelism, and if 1 day we want to change it we can. Also, we've observed better throughput with writeBatchConcurrency = 2, so I updated the default.
benjben
added a commit
that referenced
this pull request
Feb 17, 2025
For pods with more than 1 CPU, tests have shown that we get a better throughput when adding some parallelism on CPU-intensive steps. Actually, tests have shown that we get the best throughput when adding parallelism to transform step only. I wonder if we should remove the parallelism on parseBytes, but given that it is there now, I'd be in favor of keeping it with a low default, so that there is no parallelism, and if 1 day we want to change it we can. Also, we've observed better throughput with writeBatchConcurrency = 2, so I updated the default.
benjben
pushed a commit
that referenced
this pull request
Feb 17, 2025
- Update license to SLULA 1.1 - Cluster by event_name when creating new table (#402) - Add parallelism to parseBytes and transform (#400) - Decrease default batching.maxBytes to 10 MB (#398) - Fix and improve ProcessingSpec for legacy column mode (#396) - Add legacyColumnMode configuration (#394) - Add e2e_latency_millis metric (#391) - Fix startup on missing existing table (#384) - Add option to exit on missing Iglu schemas (#382) - Refactor health monitoring (#381) - Feature flag to support the legacy column style -- bug fixes (#379 #380) - Require alter table when schema is evolved for contexts - Allow for delay in Writer discovering new columns - Stay healthy if BigQuery table exceeds column limit (#372) - Recover from server-side schema mismatch exceptions - Improve exception handling immediately after altering the table - Manage Writer resource to be consistent with Snowflake Loader
benjben
pushed a commit
that referenced
this pull request
Feb 17, 2025
- Update license to SLULA 1.1 - Cluster by event_name when creating new table (#402) - Add parallelism to parseBytes and transform (#400) - Decrease default batching.maxBytes to 10 MB (#398) - Fix and improve ProcessingSpec for legacy column mode (#396) - Add legacyColumnMode configuration (#394) - Add e2e_latency_millis metric (#391) - Fix startup on missing existing table (#384) - Add option to exit on missing Iglu schemas (#382) - Refactor health monitoring (#381) - Feature flag to support the legacy column style -- bug fixes (#379 #380) - Require alter table when schema is evolved for contexts - Allow for delay in Writer discovering new columns - Stay healthy if BigQuery table exceeds column limit (#372) - Recover from server-side schema mismatch exceptions - Improve exception handling immediately after altering the table - Manage Writer resource to be consistent with Snowflake Loader
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For pods with more than 1 CPU, tests have shown that we get a better throughput when adding some parallelism on CPU-intensive steps.
Actually, tests have shown that we get the best throughput when adding parallelism to
transform
step only.I wonder if we should remove the parallelism on
parseBytes
, but given that it is there now, I'd be in favor of keeping it with a low default, so that there is no parallelism, and if 1 day we want to change it we can.Also, we've observed better throughput with
writeBatchConcurrency = 2
, so I updated the default.