Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: filter GitHub workflows via query parameter for better queue count accuracy #6519

Merged
merged 5 commits into from
Feb 12, 2025

Conversation

silviu-dinu
Copy link
Contributor

@silviu-dinu silviu-dinu commented Feb 1, 2025

The current GitHub runner scaler implementation tries to determine the workflow queue length by fetching the latest 30 workflow runs via an API request to GitHub and then filtering the results by specific statuses (queued, in_progress) at the client side. This can be inaccurate when there are queued workflows older than the latest 30 items (the default limit of GitHub API) and results in queued jobs not being picked up. This issue usually manifests when re-running older jobs.

The proposed solution is to filter workflows on the server side by using the ?status=queued/in_progress query parameter for the /actions/runs API call. Additionally, setting ?per_page=100 (maximum value) when calling /actions/runs/{run_id}/jobs API, instead of the default limit of 30.

Checklist

  • When introducing a new scaler, I agree with the scaling governance policy
  • I have verified that my change is according to the deprecations & breaking changes policy
  • Tests have been added
  • Changelog has been updated and is aligned with our changelog requirements
  • (N/A) A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
  • (N/A) A PR is opened to update the documentation on (repo) (if applicable)
  • Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #6519

@vogonistic
Copy link

I've run into the same issue where I have a workflow that has >30 jobs, but looking at the Github REST Api, I don't see any filters for status. Maybe a better fix would be to support pagination and just fetch enough pages until total_count jobs have been seen?

@silviu-dinu
Copy link
Contributor Author

I've run into the same issue where I have a workflow that has >30 jobs, but looking at the Github REST Api, I don't see any filters for status. Maybe a better fix would be to support pagination and just fetch enough pages until total_count jobs have been seen?

@vogonistic Actually, there is a status filter on the /actions/runs API endpoint. This is the call I'm trying to change in this PR which should solve the issue.

The API you're referring to is /actions/runs/{run_id}/jobs. That one doesn't have a filter, but we don't need it anyway since it's unlikely that a single run has more than 30 jobs.

I think adding this query parameter should work, but I'll need some time to understand and fix the failing tests.

…unt accuracy

Signed-off-by: silviu-dinu <silviudn@gmail.com>
@vogonistic
Copy link

I've run into the same issue where I have a workflow that has >30 jobs, but looking at the Github REST Api, I don't see any filters for status. Maybe a better fix would be to support pagination and just fetch enough pages until total_count jobs have been seen?

@vogonistic Actually, there is a status filter on the /actions/runs API endpoint. This is the call I'm trying to change in this PR which should solve the issue.

The API you're referring to is /actions/runs/{run_id}/jobs. That one doesn't have a filter, but we don't need it anyway since it's unlikely that a single run has more than 30 jobs.

I think adding this query parameter should work, but I'll need some time to understand and fix the failing tests.

@silviu-dinu Nice! I missed that you switched API. While this solved my problem, it does add a known limitation that others might have an issue with and should probably either be documented or solved with pagination as well.

@silviu-dinu
Copy link
Contributor Author

@silviu-dinu Nice! I missed that you switched API. While this solved my problem, it does add a known limitation that others might have an issue with and should probably either be documented or solved with pagination as well.

@vogonistic Not sure I understand. Which known limitation is the change in this PR adding?

@vogonistic
Copy link

@vogonistic Not sure I understand. Which known limitation is the change in this PR adding?

I’m talking about this part:

but we don't need it anyway since it's unlikely that a single run has more than 30 jobs.

I’m guessing someone thought the same in the initial implementation, but I’ve spent several workdays trying to figure out where the problem stems from. So if it’ll never return more than 30 queued + 30 in_progress, it’s worth documenting in my opinion. Am I misunderstanding how it’ll work?

@silviu-dinu
Copy link
Contributor Author

silviu-dinu commented Feb 6, 2025

I’m guessing someone thought the same in the initial implementation, but I’ve spent several workdays trying to figure out where the problem stems from. So if it’ll never return more than 30 queued + 30 in_progress, it’s worth documenting in my opinion. Am I misunderstanding how it’ll work?

@vogonistic You're right, it will only scale up to 30 + 30 runners maximum, but that is per each call cycle. KEDA would keep calling GitHub (i.e., every minute) and should spin more runners if it still finds pending jobs. So, I don't see an issue with this, except maybe a slight delay when there are lots of jobs pending depending on the configured loop interval.

I've run into the same issue where I have a workflow that has >30 jobs

I also increased the job page size query parameter to the maximum value 100 with this commit for now since it' much easier to implement than pagination.

@silviu-dinu silviu-dinu marked this pull request as ready for review February 6, 2025 18:43
@silviu-dinu silviu-dinu requested a review from a team as a code owner February 6, 2025 18:43
Signed-off-by: Silviu Dinu <silviudn@gmail.com>
Signed-off-by: silviu-dinu <silviudn@gmail.com>
Signed-off-by: silviu-dinu <silviudn@gmail.com>
@JorTurFer
Copy link
Member

JorTurFer commented Feb 8, 2025

So, I don't see an issue with this, except maybe a slight delay when there are lots of jobs pending depending on the configured loop interval.

If someone is using KEDA at scale, this could be a significant limitation :( As the API looks as paginated, what about browsing the pages?

Signed-off-by: Silviu Dinu <silviudn@gmail.com>
@silviu-dinu
Copy link
Contributor Author

If someone is using KEDA at scale, this could be a significant limitation :( As the API looks as paginated, what about browsing the pages?

@JorTurFer Thanks for your inputs! I understand the concern, but the main intent behind this PR is improving the count accuracy and fixing the scenario when jobs are not getting picked up at all when they are not within the last 30 workflows. We currently have an integration between GitHub and Microsoft Azure Container Apps which uses KEDA under the hood and we see jobs getting stuck frequently due to this issue. So maybe this fix can make it into Microsoft's product eventually.

The proposed improvement that should fix jobs getting stuck forever consists of the following:

  • Ensure only relevant workflows are fetched by filtering pending & in_progress statuses server side via status query parameter (instead of fetching all runs and filtering them client side)
  • Fetch the maximum allowed number of workflow runs by setting per_page query parameter to 100
  • For each workflow run, fetch the maximum allowed number of jobs by setting per_page query parameter to 100

The scale can be tuned currently by setting the pollingInterval when configuring this particular scaler. In theory, setting this interval to 1 second, would result in spinning up to 600,000 (100 runs x 100 jobs x 60 seconds) runners per minute, after the above changes.

Pagination may improve this speed slightly, but I see other challenges with it. For example:

  • Race conditions when new jobs are added or complete between page API calls
  • GitHub API rate limits (5,000/hour - PAT or 15,000/hour - App), especially when considering multiple repositories

At the moment, I don't really see a viable way to overcome this limit for this particular scaler, until GitHub introduces a new API or enhances the existing one to fetch pending jobs filtered by labels in a single call.

@zroubalik
Copy link
Member

@silviu-dinu thanks for the summary, it does make sense to me and I think that at the moment it is a great fix.

What we can do is to document this behavior, @silviu-dinu could you please open a docs PR to improve the documentation of the scaler and discuss this specific issue (basically what you proposed with the pollingInterval etc).

@JorTurFer WDYT?

@JorTurFer
Copy link
Member

yeah, thanks for the explanation! Lets just document the current situation and go ahead

@silviu-dinu
Copy link
Contributor Author

@zroubalik Thanks for reviewing!

@silviu-dinu could you please open a docs PR to improve the documentation of the scaler and discuss this specific issue (basically what you proposed with the pollingInterval etc).

Sure, no problem. I raised this PR for updating the docs as suggested: kedacore/keda-docs#1535

@JorTurFer
Copy link
Member

JorTurFer commented Feb 12, 2025

/run-e2e github
Update: You can check the progress here

@zroubalik zroubalik merged commit 0060862 into kedacore:main Feb 12, 2025
20 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants