Adjust key buffer sizes in CI based on parallelism #2275

tnozicka · 2024-12-17T08:09:26Z

Description of your changes:
This PR adjust the key buffer sizes based on the parallelism, so the cache has a real chance to be effectively used. This should help with many flake while keeping the cpu allocation. We can adjust the cpu allocation separately - this just makes a better use of what we have.

Which issue is resolved by this Pull Request:
Resolves #2274

scylla-operator-bot · 2024-12-17T08:09:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tnozicka

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tnozicka]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tnozicka · 2024-12-17T09:19:58Z

#2267 landed
/retest

tnozicka · 2024-12-17T09:40:10Z

/test images

tnozicka · 2024-12-17T09:56:59Z

/retest

rzetelskik

To make this effective, shouldn't we give the operator some warmup time? Otherwise it seems like it may actually make things worse?
So far we only saw the test with a short timeout fail in periodics which actually have bigger buffers (10/30), so empirically it looks like the smaller buffers could have actually helped get the controllers more cpu time not allocated to generating certs?

hack/ci-deploy.sh

rzetelskik · 2024-12-17T10:07:30Z

hack/ci-deploy.sh

+CRYPTO_KEY_BUFFER_SIZE_MIN=6
+export CRYPTO_KEY_BUFFER_SIZE_MIN
+CRYPTO_KEY_BUFFER_SIZE_MAX=10
+export CRYPTO_KEY_BUFFER_SIZE_MAX
+if [[ -n "${SO_E2E_PARALLELISM-}" ]]; then
+  CRYPTO_KEY_BUFFER_SIZE_MIN=$(( "${CRYPTO_KEY_BUFFER_SIZE_MIN}" * "${SO_E2E_PARALLELISM}" ))
+  CRYPTO_KEY_BUFFER_SIZE_MAX=$(( "${CRYPTO_KEY_BUFFER_SIZE_MAX}" * "${SO_E2E_PARALLELISM}" ))
+fi


this also needs to be done in ci-deploy-release script, especially that we only saw the tests fail in periodics

that we only saw the tests fail in periodics

this fails on presubmits as well, see the linked issue

ok, haven't seen that before (still needs to be wired in the other script)

tnozicka · 2024-12-17T10:28:32Z

To make this effective, shouldn't we give the operator some warmup time?

This is orthogonal, even without it there is still time where we wait for cluster rollout, generate 3 certs and wait to have to little of them later.

Otherwise it seems like it may actually make things worse?

not for the certs to my knowledge, it just makes the load constant but I don't think there are other compute extensive task to compete with

So far we only saw the test with a short timeout fail in periodics which actually have bigger buffers (10/30)

nope, see the "resolves issue" which is a presubmit #2274 (comment)
failing with buffer size 3
(and there is a lot of filed flakes that have not been identified yet, and this may affect all of them)

scylla-operator-bot · 2024-12-17T10:47:37Z

@tnozicka: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gke-parallel	`44c3b00`	link	true	`/test e2e-gke-parallel`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

tnozicka · 2024-12-17T11:04:45Z

but this might make more mess without the limits / quaranteed qos, let's wait for that

Adjust key buffer sizes in CI based on parallelism

44c3b00

tnozicka added kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Dec 17, 2024

scylla-operator-bot bot requested review from rzetelskik and zimnx December 17, 2024 08:09

scylla-operator-bot bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 17, 2024

tnozicka added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. and removed kind/flake Categorizes issue or PR as related to a flaky test. labels Dec 17, 2024

tnozicka mentioned this pull request Dec 17, 2024

Add node affinities to ScyllaDB workloads and fix topology constraints in orphaned PV replacement e2e test #2267

Merged

rzetelskik reviewed Dec 17, 2024

View reviewed changes

tnozicka closed this Dec 17, 2024

rzetelskik mentioned this pull request Jan 13, 2025

Adjust key buffer sizes in CI based on parallelism #2312

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust key buffer sizes in CI based on parallelism #2275

Adjust key buffer sizes in CI based on parallelism #2275

tnozicka commented Dec 17, 2024

scylla-operator-bot bot commented Dec 17, 2024

tnozicka commented Dec 17, 2024

tnozicka commented Dec 17, 2024

tnozicka commented Dec 17, 2024

rzetelskik left a comment

rzetelskik Dec 17, 2024

tnozicka Dec 17, 2024

rzetelskik Dec 17, 2024

tnozicka commented Dec 17, 2024

scylla-operator-bot bot commented Dec 17, 2024

tnozicka commented Dec 17, 2024

Adjust key buffer sizes in CI based on parallelism #2275

Adjust key buffer sizes in CI based on parallelism #2275

Conversation

tnozicka commented Dec 17, 2024

scylla-operator-bot bot commented Dec 17, 2024

tnozicka commented Dec 17, 2024

tnozicka commented Dec 17, 2024

tnozicka commented Dec 17, 2024

rzetelskik left a comment

Choose a reason for hiding this comment

rzetelskik Dec 17, 2024

Choose a reason for hiding this comment

tnozicka Dec 17, 2024

Choose a reason for hiding this comment

rzetelskik Dec 17, 2024

Choose a reason for hiding this comment

tnozicka commented Dec 17, 2024

scylla-operator-bot bot commented Dec 17, 2024

tnozicka commented Dec 17, 2024