Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Adjust key buffer sizes in CI based on parallelism #2275

Closed
wants to merge 1 commit into from

Conversation

tnozicka
Copy link
Contributor

Description of your changes:
This PR adjust the key buffer sizes based on the parallelism, so the cache has a real chance to be effectively used. This should help with many flake while keeping the cpu allocation. We can adjust the cpu allocation separately - this just makes a better use of what we have.

Which issue is resolved by this Pull Request:
Resolves #2274

@tnozicka tnozicka added kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Dec 17, 2024
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tnozicka

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@scylla-operator-bot scylla-operator-bot bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 17, 2024
@tnozicka tnozicka added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. and removed kind/flake Categorizes issue or PR as related to a flaky test. labels Dec 17, 2024
@tnozicka
Copy link
Contributor Author

#2267 landed
/retest

@tnozicka
Copy link
Contributor Author

/test images

@tnozicka
Copy link
Contributor Author

/retest

Copy link
Member

@rzetelskik rzetelskik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this effective, shouldn't we give the operator some warmup time? Otherwise it seems like it may actually make things worse?
So far we only saw the test with a short timeout fail in periodics which actually have bigger buffers (10/30), so empirically it looks like the smaller buffers could have actually helped get the controllers more cpu time not allocated to generating certs?

Comment on lines +39 to +46
CRYPTO_KEY_BUFFER_SIZE_MIN=6
export CRYPTO_KEY_BUFFER_SIZE_MIN
CRYPTO_KEY_BUFFER_SIZE_MAX=10
export CRYPTO_KEY_BUFFER_SIZE_MAX
if [[ -n "${SO_E2E_PARALLELISM-}" ]]; then
CRYPTO_KEY_BUFFER_SIZE_MIN=$(( "${CRYPTO_KEY_BUFFER_SIZE_MIN}" * "${SO_E2E_PARALLELISM}" ))
CRYPTO_KEY_BUFFER_SIZE_MAX=$(( "${CRYPTO_KEY_BUFFER_SIZE_MAX}" * "${SO_E2E_PARALLELISM}" ))
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also needs to be done in ci-deploy-release script, especially that we only saw the tests fail in periodics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that we only saw the tests fail in periodics

this fails on presubmits as well, see the linked issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, haven't seen that before (still needs to be wired in the other script)

@tnozicka
Copy link
Contributor Author

To make this effective, shouldn't we give the operator some warmup time?

This is orthogonal, even without it there is still time where we wait for cluster rollout, generate 3 certs and wait to have to little of them later.

Otherwise it seems like it may actually make things worse?

not for the certs to my knowledge, it just makes the load constant but I don't think there are other compute extensive task to compete with

So far we only saw the test with a short timeout fail in periodics which actually have bigger buffers (10/30)

nope, see the "resolves issue" which is a presubmit #2274 (comment)
failing with buffer size 3
(and there is a lot of filed flakes that have not been identified yet, and this may affect all of them)

Copy link
Contributor

@tnozicka: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-parallel 44c3b00 link true /test e2e-gke-parallel

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tnozicka
Copy link
Contributor Author

but this might make more mess without the limits / quaranteed qos, let's wait for that

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ARCHIVED] CI is running parallel tests with undersized key buffers
2 participants