-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Discussion: Use a Docker image for building binaries/files prior to generating SLSA provenances #23
Comments
Thanks for sharing @rbehjati ! I really like the approach of using docker to accommodate complex build systems. We could "easily" wrap this up in a reusable workflow to streamline the the work. @MarkLodato @asraa wdut? |
+1!!! I really like this -- currently this provenance generator is responsible for creating signed provenance populated with github context information. I was just chatting with Laurent on how we could easily create shared code that could apply to these docker image builds and other use cases on github workflows. For general use cases like yours: (1) We can use the output of slsa-github-generator/slsa/provenance.go Line 34 in 9a875d0
(2) Provide library functions to augment with specific information like the buildConfig, custom buildType, and materials (3) Generate and output signed provenances with either (A) Use Fulcio signer in this library if a raw output is needed or (B) cosign CLI signers in the case of uploading the provenance attached directly to the image (we can also raw output this). In this tool the AI's for the above process would be:
|
Are you planning on signing the provenance in order for the output of the trusted builder to be non-forgeable? If so, where would you hold the provenance? Maybe on the provenance branch with a file commit like .sig? The .sig file would contain something like this, which contains the sig, certificate, and offline info to verify the transparency log information. Cosign understands this format as input. |
This looks pretty neat. I like the idea of having the build and provenance generation be more fully encapsulated. If we can make sure that the context where the builder image gets executed is safe enough and can't alter the provenance down the line, we can maybe allow folks to specify their own image and the reusable workflow will be responsible for executing it and incorporating the information on the commands run into the provenance. It seems worth exploring more. |
I also thought about users giving their own dockerfile: I was wondering if there are options (say, root docker) which may compromise security. But I think the consensus is that this is out of scope of SLSA, so I think it would work too. |
Yeah, I from a technical standpoint we need to make sure that whatever does the build can't modify the provenance (i.e. command X was run but command Y was added to the provenance) but other than that we don't necessarily care what the build actually does). I think we should be able to to things like get things like the entrypoint from the image? or just add the We probably do need a way to extract the build artifacts from the build job in a safe way too. Maybe the docker image would need to get the sha256sum of the artifacts and upload them itself? and then pass the names off to the provenance step somehow? |
I believe both these are covered by @rbehjati 's suggestion: the repo owner declares the Docker image digest (which is content-addressed, and needs to have been pre-built and pushed to some Docker registry) and the current git commit (which is also content-addressed) is mounted at a specific (possibly configurable) path under that docker image; also the owner specifies the command to run under this configuration, and what output file to measure; the measurement needs to be done in a trustworthy way by the workflow after the command has finished running (i.e. we don't need to trust the owner for this either). These parameters may be provided to the workflow directly, or ideally as a separate TOML file. The latter has the advantage that we can build (and in fact have already built) tooling that parses it and runs the same exact steps locally, i.e. on the developer's machine, without having to rely on GitHub actions for instance while debugging things. Also it can be ported trivially to other CI systems; e.g. the same TOML file may be the input to the GitHub actions workflow, but also a Google Cloud Build workflow (without having to maintain two distinct workflow files in sync). Finally a provenance file is created with all these pieces of information, plus whatever other metadata we get from the builder via the OIDC token. |
Another option that I would like to explore at some point is to set up an instance of https://github.com/google/ent (a universal Content Addressable Store) and store provenances there. It would be a nice counterpart to Rekor: Rekor stores signatures over hashes, Ent stores the actual artifacts (or provenances, or anything else really) indexed by their hash. |
That is exactly the idea with the build tool in transparent-release. It would be nice to further generalize this tool and turn it into a reusable library that can be used in GitHub workflows. |
Yeah, I'm sure we could do something similar and then to run the provenance generation step in a separate job from the builder as I'd not like to trust the builder job at least after the untrusted container gets executed. It seems the path to the repo could be provided or you fetch it. I'm sure we could just checkout the repo in a build step, but I'm curious if there is any any benefit you saw to having the builder fetch the repo itself? |
It is just for convenience. When running as a GitHub action, the repo is already checked out, so the option of passing the path to the builder would have to be used. |
cc @loosebazooka working on distroless - this may be the way we can generate provenance using a script/dockerfile |
Just for clarity, there are two discussions we can have
I think both have merit but probably need to be discussed separately. Probably implemented as different workflows. |
Once the container workflows are available, we can also verify the container image provenance before using it as a builder (in the re-usable workflow). Any strong reasons for using TOML vs yaml? |
follow-up: do we actually need a config file? The container should be able to do everything, especially for complex build that require fetching sources from different places. I would imagine the interface to the builder could be just:
dockerfile is great for a human to read, but poor in terms of immutability (apt-get, etc). The image is great for immutability, but harder for a human to inspect unless it itself has provenance. |
I don't have a strong preference :)
Perhaps not. The snippet you have provided should work. I think our original design was intended at reusing the same builder image for building several binaries. But you could wrap all the options for building each binary into a new Dockerfile and a separate docker image, and just use that for building the binaries. This is perhaps a better solution, especially if we are providing provenances for docker images too (which should be straightforward). |
I think it's still nice to have a dedicated config file (in fact, one per target, of which there may be multiple). The main use case for this is that it would then be possible to build tooling to make it possible to run the build locally (even though of course it would not generate a signed provenance). For instance, my main problem with GitHub actions at the moment is that it's impossible to run something similar to their own builder locally. At least GCB does allow triggering a job from a local machine, but ideally we should be able to run something equivalent completely locally and offline, and verifying the output. In principle we could also build something that parses a GitHub actions workflow file to extract these fields from there, but it seems backwards to me. I am thinking of these TOML / YAML files as targets in a Makefile, and it should be possible to invoke any of them, without involving GitHub actions at all. In fact, GitHub actions should itself delegate to our builder, and the GitHub actions workflow file should simply point to the relevant TOML / YAML files IMO. Re: TOML vs YAML, I don't mind too much, but TOML is substantially simpler to parse correctly than YAML, while still being sort of human readable. I don't think most users would even notice the difference anyways. |
I agree with Tiziano. Minimal BuildConfigs like the ones we currently have in transparent-release would be nice. |
You can trigger a remote build from a local machine using a GH API. Local builds are not possible, though. Unless you use something like https://github.com/nektos/act. But that's not universal for all builders. I don't know if the The two options that seems necessary are For a rebuilder, the Or we need to think differentiate the 2 use cases: builders and re-builders...? |
If you are calling If the idea is to have a separate builder tool, for instance for better testability, similar to what we have in transparent-release, then
I agree. We are going to completely remove |
We should also clarify exactly what the trust model is, and in particular what gets cryptographically bound to what and by whom. For instance, AFAICT, Fulcio is the root of trust that binds the identity of a job with a fresh signing certificate. Presumably information about the job is embedded in the certificate itself, but we need to start from that. Similarly, Fulcio in turn trusts a token generated by GitHub actions itself, so we should also look at what that token contains, and how it is bound to the workload. For instance, I expect the commit hash to be bound to the certificate somehow, but it would be good to clarify what a verifier would have to do to confirm this; in particular, a verifier would probably not trust the commit hash field in the provenance file, but it would actually look at the one bound in the certificate (or perhaps compare both of these, and ensure they are correct). |
How about the following:
uses: path/to/builder
with:
output-folder: ./some/folder/
dockerfile: path/to/Dockerfile OR. # Note: this would assume the Dockerfile is stored on the same repository
image: gcr.io/bla@sha256:xxx
configuration: something # An opaque string interpreted by the dockerfile / container image Here users may want to access some env variables / GitHub context, so we would forward them The provenance file would attest to the repo / hash and container image (TBD where we'd report it)
uses: path/to/builder
with:
output-folder: ./some/folder/
image: gcr.io/oak-builder@sha256:xxx
configuration: ./path/config.toml Would the above work? |
the repo / hash is bound to a cert, you're correct. It's also inside the provenance and a verifier should be able to trust it so long as they trust our builder, which is also embedded in the cert that Fulcio signs. Let me know if this clarifies the trust model or not. |
I think we can make it work. But I am still a bit worried about the testability of this approach. Before writing the workflow action, we'd want to be able to test it locally. I suppose occasionally people might need to debug the build as well. What is required for testing this locally? What does the builder (obtained from |
Since the Dockerfile + config defines everything, a user should be able to run the equivalent
Correct. Nothing else. |
Thanks for yesterday's meeting. The following is my summary of the discussions, referring heavily to the following suggestion from @laurentsimon: uses: path/to/builder
with:
output-folder: ./some/folder/
image: gcr.io/oak-builder@sha256:xxx
configuration: ./path/config.toml
Remaining questions:
Please add to or correct my summary if anything is missing or incorrect. |
Thanks for the update @rbehjati and @laurentsimon ! I am not sure I understand how this would be used for transparent release in practice, and especially how someone would verify the generated provenance file. Could we go through an example? It would help me understand things better. Perhaps let's consider this build config file from Oak: https://github.com/project-oak/oak/blob/87a33746f3f512ec3ece204fa26704bdf9a08846/buildconfigs/oak_functions_loader_base.toml What would the corresponding workflow be? In particular, how would things work without having the And why is the output path in the workflow instead of the build config? Is it because the workflow needs to do something special with it, and cannot parse the config file?
I think we should not allow providing extra arguments if we can avoid it at all. I would prefer we literally hardcode a single "standard" mount path (I think
I don't think baking the command in the docker image scales well; for instance, in Oak, we want to use the same docker image, but run different commands for different targets. Hardcoding the command in the image would require us to create (and maintain) as many images as commands, plus additional images for local development. But maybe you meant something else and I misunderstood the point?
I suggest not supporting Dockerfile, since in general building a docker image is not an idempotent operation, even from the same Dockerfile. This may introduce subtle issues when things appear to run on the same image, but actually the images are completely different. e.g. imagine the Dockerfile has a command that fetches a resource from a URL, and the target of the URL changes over time. This would be solved if we used content addressable stores for everything, but we are not there yet :) (shameless plug for https://github.com/google/ent ) |
I think this is to let the docker image parse it. This way the re-usable workflow we provide in this repo could be used for any purposes: users who just want to define their pipeline via a Dockerfile / image, or transparent release users who use a common container image provided by you. Someone who wants to re-build will have the "configuration" available in the provenance , so they should be able to re-run the build. In a nutshell, we're just saying that this configuration is opaque to the the builder in this repo, and the container image can interpret the way it wants.
Using a "configuration" should solve the problem. The container image (which you control) will get the repository from the GH env variables, checkout the repo and read the config, then extract the relevant information, including the path to the script. When re-running locally, we start with the provenance instead. The provenance records the env variables, GH context, the configuration string, and the command to run You could replay it in GCB as well. Would that work? AI on my end is to read your paper, so let me know if I mis-understood something. |
My understanding is that the generic builder would not have to read or parse the config file. If so, this proposal sounds good to me. What do you mean by the path to the script?
I suppose the command will be more complicated than that. For instance, for the image to be able to use the config file, it should be mounted. Also, some testing might be required before setting up the workflow on GitHub. At that point there is no provenance to start from. So I still think some additional tooling for testing locally should be provided. But we can work out the details of that as we make progress with the design. |
Sorry for the confusion. What I meant is that if "configuration" option is a path to a script, then your image will be responsible for reading this file. If the "configuration" is JSON-like or any other format, then you can use it right away. It's up to you to decide. Does this work?
I agree that some tooling will be required but hopefully the can be hidden from end users. |
Yes. This generally sounds good to me. Thanks. |
If the image is the same, ie same hash this would be fine, correct? I've been thinking about dockerfile support. I think many users will want to declare a Dockerfile, and not go thru the extra steps of generating a container image themselves. GHA today can do that for users, ie you can define a Dockerfile to build a GHA. One idea could be to support Dockerfiles, but cache the corresponding images for subsequent builds. Something like the following:
Would this work? Wdut? Note: I don't think this is needed by the release transparency project, we can continue supporting an image as one of the inputs. |
I am not a Docker or GHA expert, so I cannot really say if it works or not, but the solution sounds good to me, especially given your note about the release transparency project. Thanks. |
Follow-up discussion about the idea of using a Docker image as the builder/releaser, as we have in project Oak.
In project Oak, and as part of our transparent-release work, we use a builder image for building binaries. The builder image is a Docker image, which has all the tools required for building the binary installed, and the required environment variables set. It might be interesting to use a similar idea here for building the binaries and generating the provenances. This can be used as an alternative to tools like go-releaser.
Currently in our tooling for transparent-release, the build command is a
docker run
command that runs a given command in the builder image. When generating SLSA provenances, we include this information as the BuildConfig. See also our custom buildType. In addition, we include the builder image in the list of materials. The build tool fetches the specified docker image and ensures that the command for building the binary is executed using the fetched builder image. The builder image is identified by a URI containing the digest of the image. If the versions of the toolchains are fixed in the Dockerfile (example from Oak) and the checksums are verified, then this can get very close to the idea of a trusted builder.Here is an example of such a SLSA provenance file, with BuildConfig and materials as described above.
This is our GitHub action that generates provenances. We generate provenances for each commit that is merged into the main branch. It currently doesn’t use the build tool from transparent-release (because the build does not yet generate a provenance file), but we plan to use this build toll with a simple TOML file similar to this example. The idea is to have the TOML file checked into the repo as a static file (containing only the
command
, theoutput_path
and a few other fields), and let the GitHub actions job fill out thecommit_hash
and thebuilder_image
URI that are different for each commit and invocation of the build tool.A similar minimal TOML can be used here for building the binary using a builder image provided by the maintainers of the repo.
Note that in our approach, we don’t fill out the invocation part in the SLSA predicate, as we think all the information is provided in the buildConfig, and materials.
cc @laurentsimon, @tiziano88
The text was updated successfully, but these errors were encountered: