Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add parallelism to parseBytes and transform #400

Merged
merged 1 commit into from
Feb 14, 2025
Merged

Add parallelism to parseBytes and transform #400

merged 1 commit into from
Feb 14, 2025

Conversation

benjben
Copy link
Contributor

@benjben benjben commented Feb 13, 2025

For pods with more than 1 CPU, tests have shown that we get a better throughput when adding some parallelism on CPU-intensive steps.

Actually, tests have shown that we get the best throughput when adding parallelism to transform step only.

I wonder if we should remove the parallelism on parseBytes, but given that it is there now, I'd be in favor of keeping it with a low default, so that there is no parallelism, and if 1 day we want to change it we can.

Also, we've observed better throughput with writeBatchConcurrency = 2, so I updated the default.

For pods with more than 1 CPU, tests have shown that we get a better throughput
when adding some parallelism on CPU-intensive steps.

Actually, tests have shown that we get the best throughput
when adding parallelism to transform step only.

I wonder if we should remove the parallelism on parseBytes,
but given that it is there now, I'd be in favor of keeping it with a low default,
so that there is no parallelism, and if 1 day we want to change it we can.

Also, we've observed better throughput with writeBatchConcurrency = 2,
so I updated the default.
@benjben benjben merged commit 13292fc into v2 Feb 14, 2025
2 checks passed
@benjben benjben deleted the parallelism branch February 14, 2025 09:22
benjben added a commit that referenced this pull request Feb 14, 2025
For pods with more than 1 CPU, tests have shown that we get a better throughput
when adding some parallelism on CPU-intensive steps.

Actually, tests have shown that we get the best throughput
when adding parallelism to transform step only.

I wonder if we should remove the parallelism on parseBytes,
but given that it is there now, I'd be in favor of keeping it with a low default,
so that there is no parallelism, and if 1 day we want to change it we can.

Also, we've observed better throughput with writeBatchConcurrency = 2,
so I updated the default.
benjben added a commit that referenced this pull request Feb 14, 2025
For pods with more than 1 CPU, tests have shown that we get a better throughput
when adding some parallelism on CPU-intensive steps.

Actually, tests have shown that we get the best throughput
when adding parallelism to transform step only.

I wonder if we should remove the parallelism on parseBytes,
but given that it is there now, I'd be in favor of keeping it with a low default,
so that there is no parallelism, and if 1 day we want to change it we can.

Also, we've observed better throughput with writeBatchConcurrency = 2,
so I updated the default.
benjben added a commit that referenced this pull request Feb 17, 2025
For pods with more than 1 CPU, tests have shown that we get a better throughput
when adding some parallelism on CPU-intensive steps.

Actually, tests have shown that we get the best throughput
when adding parallelism to transform step only.

I wonder if we should remove the parallelism on parseBytes,
but given that it is there now, I'd be in favor of keeping it with a low default,
so that there is no parallelism, and if 1 day we want to change it we can.

Also, we've observed better throughput with writeBatchConcurrency = 2,
so I updated the default.
benjben pushed a commit that referenced this pull request Feb 17, 2025
- Update license to SLULA 1.1
- Cluster by event_name when creating new table (#402)
- Add parallelism to parseBytes and transform (#400)
- Decrease default batching.maxBytes to 10 MB (#398)
- Fix and improve ProcessingSpec for legacy column mode (#396)
- Add legacyColumnMode configuration (#394)
- Add e2e_latency_millis metric (#391)
- Fix startup on missing existing table (#384)
- Add option to exit on missing Iglu schemas (#382)
- Refactor health monitoring (#381)
- Feature flag to support the legacy column style -- bug fixes (#379 #380)
- Require alter table when schema is evolved for contexts
- Allow for delay in Writer discovering new columns
- Stay healthy if BigQuery table exceeds column limit (#372)
- Recover from server-side schema mismatch exceptions
- Improve exception handling immediately after altering the table
- Manage Writer resource to be consistent with Snowflake Loader
benjben pushed a commit that referenced this pull request Feb 17, 2025
- Update license to SLULA 1.1
- Cluster by event_name when creating new table (#402)
- Add parallelism to parseBytes and transform (#400)
- Decrease default batching.maxBytes to 10 MB (#398)
- Fix and improve ProcessingSpec for legacy column mode (#396)
- Add legacyColumnMode configuration (#394)
- Add e2e_latency_millis metric (#391)
- Fix startup on missing existing table (#384)
- Add option to exit on missing Iglu schemas (#382)
- Refactor health monitoring (#381)
- Feature flag to support the legacy column style -- bug fixes (#379 #380)
- Require alter table when schema is evolved for contexts
- Allow for delay in Writer discovering new columns
- Stay healthy if BigQuery table exceeds column limit (#372)
- Recover from server-side schema mismatch exceptions
- Improve exception handling immediately after altering the table
- Manage Writer resource to be consistent with Snowflake Loader
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants