-
-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Discussion: CI resources #780
Comments
ci infra considerations apart, we should review our ci and try to cut jobs that will catch errors with almost no probabilities to be triggered |
For what it's worth, I have put together a quick hack to act as a basis for discussion: https://gitlab.haskell.org/bgamari/haskell-language-server/-/merge_requests/1. At the moment it is running about 8 concurrent jobs, with each taking around 10 minutes (although this should be shorter once the cabal caches are warmed). GHC CI load fluctuates pretty significantly, but we have some capacity set aside for "short" jobs (e.g. not GHC builds, which generally take ~90 minutes). One area where we are currently lacking is Darwin, although this will improve soon as we bring up a new set of donated runners. Note that GitLab can be used in a CI only mode for a GitHub repository (although I'll admit I've not used it in this way yet). |
Another thing that really ought to be considered regardless of where CI happens: Currently it appears that benchmarks are the largest contributor to CI build time. However, these really don't need to be run on every platform. The sorts of performance regressions that |
it t is too late but to take in account for future decisions: we lose a queue of free jobs merging ghcide in hls including its functional test suite and benchmarks. |
not sure about Linux and macos but performance in windows could deviate quite likely, even in hls, taking in account that we are quite close to ghc, no? |
nvm, windows is the one that could deviate the most but it is the unique without benchmarks 🤔 |
A major problem with GitHub actions is that they have no native support to cancel redundant builds https://github.heygears.community/t/github-actions-cancel-redundant-builds-not-solved/16025 Right now, 95% of the jobs in the action queue are redundant, since they refer to commits that are below the tip of their branch, or since the branch is out of date with master. There is a bunch of action providers that claim to solve this problem, so I think we need to pick and adopt one ASAP |
Summary of actions I have taken during the last couple of days:
The attempts to enable auto-cancel of redundant jobs in GitHub Actions have been unsuccessful. Since it is not a core feature, it must be scripted in an action provider. But those scripts need a GITHUB_TOKEN with write access in order to be able to cancel past jobs, and the issue is that PRs from forked repos do not get such write access, whereas PRs from local branches do. This can be changed at the org level /cc @jaspervdj but it raises security concerns. It's likely that our Github CI will continue to be overwhelmed until we manage to find a solution for redundant builds |
Things left to do, as I have ran out of time to work on this myself:
|
I am investigating how to fix windows cache, testing only last minor versions and restore macos builds (hopefully the latter will give us more time than the former will cost): master...jneira:test-last-versions |
I like to joke that the most effective way to improve performance in CI is to either do less work or do less work. What I mean by that is you can either run less or run it less frequently. Doing both will be good for the long term. I looked through all of the workflows and I have a few thoughts on how to accomplish both, listed in order of bang-for-buck in time:
More extreme savings are possible by switching the build system to something like Bazel and investing in dedicated CI infrastructure that can git checkout into bare metal with no state cleared between runs. Git fetches get way faster, caches don't have to be shoved into/out of the cloud, and the machines are much faster. I don't really recommend this approach because it makes the tooling way less approachable for contributors and you're going to end up having to do all of the above anyway, so you might as well milk the mileage out of the low hanging fruit first before considering something like Bazel. |
Thanks Jared, for your analysis, I agree with the first point. I would say:
On your second point, I'm not sure testing everything is the problem. The test workflow is fail-fast, so the whole matrix is cancelled as soon as anything fails. The test suites run in 20 minutes combined, which is not too bad. The real problem imho is redundant builds, those can easily overwhelm the CI pipeline. But all that said, the Mergify bot seems to be working wonders already |
Redundant builds are definitely a huge issue. We run into those quite a bit and I wish I had a better way to reliably control those. So getting those down is definitely a large win for usability, and Mergify is an excellent tool to help with that as well. Even if the test suites run in 20 minutes combined, if that can be reduced from 20 minutes (on average) across, say, 10-20 jobs to sometimes as low as 5-10 minutes across 2-5 jobs for smaller PRs, that has a surprisingly large cascading effect on the throughput of the overall CI queue. At the very least, it's worth thinking about for environmental concerns if the developer UX ones aren't compelling enough on their own. |
@jared-w many thanks for your insights So the other big point to be considered would be:
We had separated ghcide and hls in the past and that, in addition to have the double of ci resources, made easier tests both separately (being hls depending on ghcide).
|
One pattern that's been proven to work in other projects is to run a I'd lightly suggest |
Things has been alliviated:
However a tool described by @jared-w would be great ayways, to allow a more fine grained selection of tests to do (changes in one plugin should not trigger tests in the rest) |
I think we can close this issue, ci performance is pretty reasonable nowadays with all changes done since this in gh workflow and circleci. |
At the moment CI is becoming a major bottleneck for landing PRs.
Given that HLS is the first point of contact for many people with Haskell, I believe it is critical that we test as many different configuration options as possible, preferably on a continuous basis.
This goes along with a second important point, it must be easy to make a release, ideally just polishing a change log and tagging, once a month.
We are hitting the limits of the free resources available to OSS projects.
The Haskell Foundation is now a thing, and there is funding for the GHC infrastructure. An initial discussion on IRC with @bgamari indicates that we may be able to host the HLS CI on the GHC gitlab infrastructure (keeping dev in GH, as now).
Is this something we should explore?
The text was updated successfully, but these errors were encountered: