Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Discussion: CI resources #780

Closed
alanz opened this issue Dec 31, 2020 · 17 comments
Closed

Discussion: CI resources #780

alanz opened this issue Dec 31, 2020 · 17 comments
Labels
old_type: meta Planing and organizing other issues status: in discussion Not actionable, because discussion is still ongoing or there's no decision yet type: support User support tickets, questions, help with setup etc.

Comments

@alanz
Copy link
Collaborator

alanz commented Dec 31, 2020

At the moment CI is becoming a major bottleneck for landing PRs.

Given that HLS is the first point of contact for many people with Haskell, I believe it is critical that we test as many different configuration options as possible, preferably on a continuous basis.

This goes along with a second important point, it must be easy to make a release, ideally just polishing a change log and tagging, once a month.

We are hitting the limits of the free resources available to OSS projects.

The Haskell Foundation is now a thing, and there is funding for the GHC infrastructure. An initial discussion on IRC with @bgamari indicates that we may be able to host the HLS CI on the GHC gitlab infrastructure (keeping dev in GH, as now).

Is this something we should explore?

@alanz alanz added type: support User support tickets, questions, help with setup etc. old_type: meta Planing and organizing other issues status: in discussion Not actionable, because discussion is still ongoing or there's no decision yet labels Dec 31, 2020
@jneira
Copy link
Member

jneira commented Jan 1, 2021

ci infra considerations apart, we should review our ci and try to cut jobs that will catch errors with almost no probabilities to be triggered
only test last minor versions of ghc like we did in ghcide f.e.
maybe @pepeiborra can help to see where we can drop more jobs (but not too much 😝)

@bgamari
Copy link

bgamari commented Jan 1, 2021

For what it's worth, I have put together a quick hack to act as a basis for discussion: https://gitlab.haskell.org/bgamari/haskell-language-server/-/merge_requests/1. At the moment it is running about 8 concurrent jobs, with each taking around 10 minutes (although this should be shorter once the cabal caches are warmed). GHC CI load fluctuates pretty significantly, but we have some capacity set aside for "short" jobs (e.g. not GHC builds, which generally take ~90 minutes). One area where we are currently lacking is Darwin, although this will improve soon as we bring up a new set of donated runners.

Note that GitLab can be used in a CI only mode for a GitHub repository (although I'll admit I've not used it in this way yet).

@bgamari
Copy link

bgamari commented Jan 1, 2021

Another thing that really ought to be considered regardless of where CI happens: Currently it appears that benchmarks are the largest contributor to CI build time. However, these really don't need to be run on every platform. The sorts of performance regressions that haskell-language-server will feel are almost certainly platform independent.

@jneira
Copy link
Member

jneira commented Jan 1, 2021

it t is too late but to take in account for future decisions: we lose a queue of free jobs merging ghcide in hls including its functional test suite and benchmarks.
maybe we should review what are we testing in both suites and remove redundant tests

@jneira
Copy link
Member

jneira commented Jan 1, 2021

Another thing that really ought to be considered regardless of where CI happens: Currently it appears that benchmarks are the largest contributor to CI build time. However, these really don't need to be run on every platform. The sorts of performance regressions that haskell-language-server will feel are almost certainly platform independent.

not sure about Linux and macos but performance in windows could deviate quite likely, even in hls, taking in account that we are quite close to ghc, no?

@jneira
Copy link
Member

jneira commented Jan 1, 2021

nvm, windows is the one that could deviate the most but it is the unique without benchmarks 🤔
maybe we could drop macos benchs

@pepeiborra
Copy link
Collaborator

A major problem with GitHub actions is that they have no native support to cancel redundant builds

https://github.heygears.community/t/github-actions-cancel-redundant-builds-not-solved/16025

Right now, 95% of the jobs in the action queue are redundant, since they refer to commits that are below the tip of their branch, or since the branch is out of date with master.

There is a bunch of action providers that claim to solve this problem, so I think we need to pick and adopt one ASAP

@pepeiborra
Copy link
Collaborator

Summary of actions I have taken during the last couple of days:

  • Halve the number of Github action workflows from 33 to 17
  • Streamline CircleCI jobs by removing the Cabal task and skipping testing that is already covered in the GitHub Actions
  • Enable auto-cancel of redundant jobs in CircleCI and trigger only on PRs

The attempts to enable auto-cancel of redundant jobs in GitHub Actions have been unsuccessful. Since it is not a core feature, it must be scripted in an action provider. But those scripts need a GITHUB_TOKEN with write access in order to be able to cancel past jobs, and the issue is that PRs from forked repos do not get such write access, whereas PRs from local branches do. This can be changed at the org level /cc @jaspervdj but it raises security concerns.

It's likely that our Github CI will continue to be overwhelmed until we manage to find a solution for redundant builds

@pepeiborra
Copy link
Collaborator

Things left to do, as I have ran out of time to work on this myself:

  1. Fix the GitHub Actions cache on Windows. It looks like we run over the limit of 5GB relatively easily, so keep this in mind
  2. Find a solution for auto-canceling redundant builds or move away from GitHub Actions. Both CircleCI and Azure workflows support this as a core feature

@jneira
Copy link
Member

jneira commented Jan 4, 2021

Fix the GitHub Actions cache on Windows. It looks like we run over the limit of 5GB relatively easily, so keep this in mind

I am investigating how to fix windows cache, testing only last minor versions and restore macos builds (hopefully the latter will give us more time than the former will cost): master...jneira:test-last-versions
Alternatively we can only enable builds for macos.

@hazelweakly
Copy link

I like to joke that the most effective way to improve performance in CI is to either do less work or do less work. What I mean by that is you can either run less or run it less frequently. Doing both will be good for the long term.

I looked through all of the workflows and I have a few thoughts on how to accomplish both, listed in order of bang-for-buck in time:

  1. Run less stuff. Currently everything runs on every single commit of every PR out there.

    • The first change I would suggest is to switch all of the benchmarking to manual trigger instead. People can trigger benchmarks when they want (ie when they have working code to benchmark) and it vastly reduces the amount of churn that goes through the cache.
    • The nix workflow can probably be ran much less often than it is.
    • The second would be to only build a few versions of the LSP until "ready". Maybe as little as just the latest, maybe the latest 3 major version, maybe just the latest on windows/macos and 3 versions on Linux? Lots of wiggle room there. There's a surprising amount of code that only ends up needing to be (initially) tested on a single platform in practice.
  2. Every single workflow tests everything. Which doesn't scale well as you can tell :)

    There are 13 packages in cabal.project and in an ideal world each one has a separate set of tests, benchmarks, etc., and the minimal subset is selected to run every time. In a really ideal world, caching everything and letting cabal do the thing would give you this property; however, git + transient caches + VMs + etc... just makes it way too hard for a build system to reliably do change detection like that. Even teams using build systems like Bazel (designed for this exact purpose) end up having to do partial targets on CI systems.

    • Another great change to think about is only building/testing parts that you need to. The granularity doesn't have to be very high here, but the payoff can be great. Drop dependency on shake for install.hs #63 is a great example of a PR that needs nothing else to run in its CI. Plugins are another one that could be an easy win (if that makes sense for how PRs are usually scoped).

More extreme savings are possible by switching the build system to something like Bazel and investing in dedicated CI infrastructure that can git checkout into bare metal with no state cleared between runs. Git fetches get way faster, caches don't have to be shoved into/out of the cloud, and the machines are much faster. I don't really recommend this approach because it makes the tooling way less approachable for contributors and you're going to end up having to do all of the above anyway, so you might as well milk the mileage out of the low hanging fruit first before considering something like Bazel.

@pepeiborra
Copy link
Collaborator

Thanks Jared, for your analysis, I agree with the first point. I would say:

  1. Keep one of the benchmark jobs. I do check the benchmark results out of habit and there is a task to automate the check
  2. Keep one of the Nix jobs, it runs in <15 minutes and helps to keep the scripts working

On your second point, I'm not sure testing everything is the problem. The test workflow is fail-fast, so the whole matrix is cancelled as soon as anything fails. The test suites run in 20 minutes combined, which is not too bad.

The real problem imho is redundant builds, those can easily overwhelm the CI pipeline.

But all that said, the Mergify bot seems to be working wonders already

@hazelweakly
Copy link

Redundant builds are definitely a huge issue. We run into those quite a bit and I wish I had a better way to reliably control those. So getting those down is definitely a large win for usability, and Mergify is an excellent tool to help with that as well.

Even if the test suites run in 20 minutes combined, if that can be reduced from 20 minutes (on average) across, say, 10-20 jobs to sometimes as low as 5-10 minutes across 2-5 jobs for smaller PRs, that has a surprisingly large cascading effect on the throughput of the overall CI queue. At the very least, it's worth thinking about for environmental concerns if the developer UX ones aren't compelling enough on their own.

@jneira
Copy link
Member

jneira commented Jan 19, 2021

@jared-w many thanks for your insights
I think we already have reduced the work done to an acceptable set, with a good balance (for now) between coverage, valuable info and spent time.

So the other big point to be considered would be:

Every single workflow tests everything. Which doesn't scale well as you can tell :)

Another great change to think about is only building/testing parts that you need to. The granularity doesn't have to be very high here, but the payoff can be great. #63 is a great example of a PR that needs nothing else to run in its CI. Plugins are another one that could be an easy win (if that makes sense for how PRs are usually scoped).

We had separated ghcide and hls in the past and that, in addition to have the double of ci resources, made easier tests both separately (being hls depending on ghcide).
It would be great got the ci run tests for the components being effectively changed in a pr (and its dependant ones). But:

  • all plugins are being tested in the same test suite, a known point to make improvement: Test infrastructure for plugins #576
    • but we could still run specific tests using tasty -pattern
  • we should setup a configuration able to detect what components changed and its dependencies to run only specific tests (to investigate)
    • maybe a starting point could be manually use labels or commit keywords (like [ci-skip] but extended to mark what components do you want to test). The problem is it would be error prone

@hazelweakly
Copy link

One pattern that's been proven to work in other projects is to run a get-targets type of script that will analyze git metadata, perhaps PR name, or whatever else you want to use and have it output the relevant subset of stuff to run. That could take the form of cabal test -- -pattern $(./get-targets), or for target in $(./get-targets); do cabal test $target; done, or a lot of other options.

I'd lightly suggest get-targets be written in a very lightweight language; probably bash, or python. The last thing you want is to add a few minutes to CI time in an effort to save CI time, or run into annoying bootstrapping problems. (Although once the logic gets worked out, if it doesn't really ever change, curling a built binary from "somewhere" is not the worst solution)

@jneira
Copy link
Member

jneira commented Oct 4, 2021

Things has been alliviated:

  • we have separated ghcide tests runs from plugin ones (changes in plugin cancel the ghcide test suite)
  • from the 20 concurrent runs from the entire haskell org, we have 180 job runners now thanks to github.

However a tool described by @jared-w would be great ayways, to allow a more fine grained selection of tests to do (changes in one plugin should not trigger tests in the rest)

@jneira
Copy link
Member

jneira commented Dec 16, 2021

I think we can close this issue, ci performance is pretty reasonable nowadays with all changes done since this in gh workflow and circleci.
Thanks all for the suggestions!

@jneira jneira closed this as completed Dec 16, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
old_type: meta Planing and organizing other issues status: in discussion Not actionable, because discussion is still ongoing or there's no decision yet type: support User support tickets, questions, help with setup etc.
Projects
None yet
Development

No branches or pull requests

5 participants