feat: filter GitHub workflows via query parameter for better queue count accuracy #6519

silviu-dinu · 2025-02-01T20:02:05Z

The current GitHub runner scaler implementation tries to determine the workflow queue length by fetching the latest 30 workflow runs via an API request to GitHub and then filtering the results by specific statuses (queued, in_progress) at the client side. This can be inaccurate when there are queued workflows older than the latest 30 items (the default limit of GitHub API) and results in queued jobs not being picked up. This issue usually manifests when re-running older jobs.

The proposed solution is to filter workflows on the server side by using the ?status=queued/in_progress query parameter for the /actions/runs API call. Additionally, setting ?per_page=100 (maximum value) when calling /actions/runs/{run_id}/jobs API, instead of the default limit of 30.

Checklist

When introducing a new scaler, I agree with the scaling governance policy
I have verified that my change is according to the deprecations & breaking changes policy
Tests have been added
Changelog has been updated and is aligned with our changelog requirements
(N/A) A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
(N/A) A PR is opened to update the documentation on (repo) (if applicable)
Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #6519

vogonistic · 2025-02-05T07:03:59Z

I've run into the same issue where I have a workflow that has >30 jobs, but looking at the Github REST Api, I don't see any filters for status. Maybe a better fix would be to support pagination and just fetch enough pages until total_count jobs have been seen?

silviu-dinu · 2025-02-06T15:58:11Z

I've run into the same issue where I have a workflow that has >30 jobs, but looking at the Github REST Api, I don't see any filters for status. Maybe a better fix would be to support pagination and just fetch enough pages until total_count jobs have been seen?

@vogonistic Actually, there is a status filter on the /actions/runs API endpoint. This is the call I'm trying to change in this PR which should solve the issue.

The API you're referring to is /actions/runs/{run_id}/jobs. That one doesn't have a filter, but we don't need it anyway since it's unlikely that a single run has more than 30 jobs.

I think adding this query parameter should work, but I'll need some time to understand and fix the failing tests.

pkg/scalers/github_runner_scaler_test.go

…unt accuracy Signed-off-by: silviu-dinu <silviudn@gmail.com>

vogonistic · 2025-02-06T18:02:41Z

I've run into the same issue where I have a workflow that has >30 jobs, but looking at the Github REST Api, I don't see any filters for status. Maybe a better fix would be to support pagination and just fetch enough pages until total_count jobs have been seen?

@vogonistic Actually, there is a status filter on the /actions/runs API endpoint. This is the call I'm trying to change in this PR which should solve the issue.

The API you're referring to is /actions/runs/{run_id}/jobs. That one doesn't have a filter, but we don't need it anyway since it's unlikely that a single run has more than 30 jobs.

I think adding this query parameter should work, but I'll need some time to understand and fix the failing tests.

@silviu-dinu Nice! I missed that you switched API. While this solved my problem, it does add a known limitation that others might have an issue with and should probably either be documented or solved with pagination as well.

silviu-dinu · 2025-02-06T18:11:31Z

@silviu-dinu Nice! I missed that you switched API. While this solved my problem, it does add a known limitation that others might have an issue with and should probably either be documented or solved with pagination as well.

@vogonistic Not sure I understand. Which known limitation is the change in this PR adding?

vogonistic · 2025-02-06T18:19:25Z

@vogonistic Not sure I understand. Which known limitation is the change in this PR adding?

I’m talking about this part:

but we don't need it anyway since it's unlikely that a single run has more than 30 jobs.

I’m guessing someone thought the same in the initial implementation, but I’ve spent several workdays trying to figure out where the problem stems from. So if it’ll never return more than 30 queued + 30 in_progress, it’s worth documenting in my opinion. Am I misunderstanding how it’ll work?

silviu-dinu · 2025-02-06T18:36:36Z

I’m guessing someone thought the same in the initial implementation, but I’ve spent several workdays trying to figure out where the problem stems from. So if it’ll never return more than 30 queued + 30 in_progress, it’s worth documenting in my opinion. Am I misunderstanding how it’ll work?

@vogonistic You're right, it will only scale up to 30 + 30 runners maximum, but that is per each call cycle. KEDA would keep calling GitHub (i.e., every minute) and should spin more runners if it still finds pending jobs. So, I don't see an issue with this, except maybe a slight delay when there are lots of jobs pending depending on the configured loop interval.

I've run into the same issue where I have a workflow that has >30 jobs

I also increased the job page size query parameter to the maximum value 100 with this commit for now since it' much easier to implement than pagination.

Signed-off-by: Silviu Dinu <silviudn@gmail.com>

Signed-off-by: silviu-dinu <silviudn@gmail.com>

JorTurFer · 2025-02-08T23:12:19Z

So, I don't see an issue with this, except maybe a slight delay when there are lots of jobs pending depending on the configured loop interval.

If someone is using KEDA at scale, this could be a significant limitation :( As the API looks as paginated, what about browsing the pages?

Signed-off-by: Silviu Dinu <silviudn@gmail.com>

silviu-dinu · 2025-02-11T09:35:12Z

If someone is using KEDA at scale, this could be a significant limitation :( As the API looks as paginated, what about browsing the pages?

@JorTurFer Thanks for your inputs! I understand the concern, but the main intent behind this PR is improving the count accuracy and fixing the scenario when jobs are not getting picked up at all when they are not within the last 30 workflows. We currently have an integration between GitHub and Microsoft Azure Container Apps which uses KEDA under the hood and we see jobs getting stuck frequently due to this issue. So maybe this fix can make it into Microsoft's product eventually.

The proposed improvement that should fix jobs getting stuck forever consists of the following:

Ensure only relevant workflows are fetched by filtering pending & in_progress statuses server side via status query parameter (instead of fetching all runs and filtering them client side)
Fetch the maximum allowed number of workflow runs by setting per_page query parameter to 100
For each workflow run, fetch the maximum allowed number of jobs by setting per_page query parameter to 100

The scale can be tuned currently by setting the pollingInterval when configuring this particular scaler. In theory, setting this interval to 1 second, would result in spinning up to 600,000 (100 runs x 100 jobs x 60 seconds) runners per minute, after the above changes.

Pagination may improve this speed slightly, but I see other challenges with it. For example:

Race conditions when new jobs are added or complete between page API calls
GitHub API rate limits (5,000/hour - PAT or 15,000/hour - App), especially when considering multiple repositories

At the moment, I don't really see a viable way to overcome this limit for this particular scaler, until GitHub introduces a new API or enhances the existing one to fetch pending jobs filtered by labels in a single call.

zroubalik · 2025-02-11T15:22:40Z

@silviu-dinu thanks for the summary, it does make sense to me and I think that at the moment it is a great fix.

What we can do is to document this behavior, @silviu-dinu could you please open a docs PR to improve the documentation of the scaler and discuss this specific issue (basically what you proposed with the pollingInterval etc).

@JorTurFer WDYT?

JorTurFer · 2025-02-11T23:46:48Z

yeah, thanks for the explanation! Lets just document the current situation and go ahead

silviu-dinu · 2025-02-12T10:07:54Z

@zroubalik Thanks for reviewing!

@silviu-dinu could you please open a docs PR to improve the documentation of the scaler and discuss this specific issue (basically what you proposed with the pollingInterval etc).

Sure, no problem. I raised this PR for updating the docs as suggested: kedacore/keda-docs#1535

JorTurFer · 2025-02-12T10:19:47Z

/run-e2e github
Update: You can check the progress here

semgrep-app bot reviewed Feb 6, 2025

View reviewed changes

pkg/scalers/github_runner_scaler_test.go Outdated Show resolved Hide resolved

feat: filter GitHub workflows via query parameter for better queue co…

a0709c2

…unt accuracy Signed-off-by: silviu-dinu <silviudn@gmail.com>

silviu-dinu force-pushed the patch-1 branch from a6fa207 to a0709c2 Compare February 6, 2025 17:53

silviu-dinu marked this pull request as ready for review February 6, 2025 18:43

silviu-dinu requested a review from a team as a code owner February 6, 2025 18:43

silviu-dinu added 3 commits February 6, 2025 23:00

feat: GitHub runner scaler - fetch up to 100 jobs

75dff72

Signed-off-by: Silviu Dinu <silviudn@gmail.com>

Merge branch 'main' into patch-1

18e180f

Signed-off-by: silviu-dinu <silviudn@gmail.com>

Merge branch 'main' into patch-1

cdbad27

Signed-off-by: silviu-dinu <silviudn@gmail.com>

feat: GitHub runner scaler - fetch up to 100 workflow runs

7470f06

Signed-off-by: Silviu Dinu <silviudn@gmail.com>

silviu-dinu mentioned this pull request Feb 12, 2025

docs: Add Fine-Tuning section for GitHub runner scaler kedacore/keda-docs#1535

Merged

1 task

JorTurFer approved these changes Feb 12, 2025

View reviewed changes

zroubalik approved these changes Feb 12, 2025

View reviewed changes

zroubalik merged commit 0060862 into kedacore:main Feb 12, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: filter GitHub workflows via query parameter for better queue count accuracy #6519

feat: filter GitHub workflows via query parameter for better queue count accuracy #6519

silviu-dinu commented Feb 1, 2025 •

edited

Loading

vogonistic commented Feb 5, 2025

silviu-dinu commented Feb 6, 2025

vogonistic commented Feb 6, 2025

silviu-dinu commented Feb 6, 2025

vogonistic commented Feb 6, 2025

silviu-dinu commented Feb 6, 2025 •

edited

Loading

JorTurFer commented Feb 8, 2025 •

edited

Loading

silviu-dinu commented Feb 11, 2025

zroubalik commented Feb 11, 2025

JorTurFer commented Feb 11, 2025

silviu-dinu commented Feb 12, 2025

JorTurFer commented Feb 12, 2025 •

edited by github-actions bot

Loading

feat: filter GitHub workflows via query parameter for better queue count accuracy #6519

feat: filter GitHub workflows via query parameter for better queue count accuracy #6519

Conversation

silviu-dinu commented Feb 1, 2025 • edited Loading

Checklist

vogonistic commented Feb 5, 2025

silviu-dinu commented Feb 6, 2025

vogonistic commented Feb 6, 2025

silviu-dinu commented Feb 6, 2025

vogonistic commented Feb 6, 2025

silviu-dinu commented Feb 6, 2025 • edited Loading

JorTurFer commented Feb 8, 2025 • edited Loading

silviu-dinu commented Feb 11, 2025

zroubalik commented Feb 11, 2025

JorTurFer commented Feb 11, 2025

silviu-dinu commented Feb 12, 2025

JorTurFer commented Feb 12, 2025 • edited by github-actions bot Loading

silviu-dinu commented Feb 1, 2025 •

edited

Loading

silviu-dinu commented Feb 6, 2025 •

edited

Loading

JorTurFer commented Feb 8, 2025 •

edited

Loading

JorTurFer commented Feb 12, 2025 •

edited by github-actions bot

Loading