Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Automatically kill stuck PR tests and report back #2440

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

iarspider
Copy link
Contributor

Split from #2418. For now, job list and criteria for killing builds are hardcoded. We can discuss how configurable do we want this job to be - for example, a dict mapping job name and filters on params to timeout.

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @iarspider for branch master.

@cmsbuild, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 12, 2025

cms-bot internal usage


if upload_unique_id:
with urllib.request.urlopen(
"http://localhost/SDT/jenkins-artifacts/pull-request-integration/{0}/prs_commits.txt".format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iarspider , why localhost? this job runs on jenkins controller while the prs_commits.txt is available in cmssdt server

@iarspider
Copy link
Contributor Author

iarspider commented Feb 13, 2025

  • Proposed config structure:

  • Add a separate job that runs on cmssdt to figure out commit_id based on upload_unique_id (and trigger commit status change from it) - https://cmssdt.cern.ch/jenkins/job/kill-stuck-pr-test/

  • Post a PR comment that a test job was aborted due to all nodes being offline

  • For jobs triggered by ib-run-pr-tests, pass a parameter that would mark them as valid targets for cleanup

  • Pass full commit status name to jobs, instead of just prefix

Main job: jenkins-elasticsearch-monitor, triggers kill-stuck-pr-test, which in turn triggers killing, status update and comment. TODO: how to avoid multiple comments (if 2+ jobs for a single PR were stuck)?

@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

@iarspider iarspider changed the title [WIP] Split auto-killing of stuck tests from rocm-tests PR [WIP] Automatically kill stuck PR tests and report back Feb 13, 2025
@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

1 similar comment
@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

@cmsbuild
Copy link
Contributor

Pull request #2440 was updated.

@iarspider
Copy link
Contributor Author

The main job runs every 30 minutes. We can add one more check - write identifier of stuck job to a temporary file, and if after 30 min the job is still there, don't try to reconnect the node and just kill the job.

@iarspider iarspider changed the title [WIP] Automatically kill stuck PR tests and report back Automatically kill stuck PR tests and report back Mar 12, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants