-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add MergeOp to LLB #1431
Comments
Thanks for the pointer to our PR! As discussed in #1402, This functionality (and especially #870's lazy refs) will work well with recent containerd's stargz-based lazy image distribution. By leveraging this functionality we can get layer snapshots for exec something on them before the entire contents locally available and will hopefully speed up building "dev"(or non-exporting) stages. In #1402, we discussed with @tonistiigi about the following concern about lazy image distribution:
I think that if we pull layers in the background, caches don't need to hold remote refs nor check the accessibility. When we need the contents (e.g. when creating snapshots of non-stargz layers or exporting layers) and the pull hasn't completed, we just need to wait for the completion. Containerd's content store(or I think this "background pulling" can also be implemented together. I don't completely understand the whole cache design but I think I can help the implementation (or the parts related to #870/#1402) if you give me design/implementation direction. |
Example case: First build that uses stargz image runs without exporter. Stragz pulls partially and build completes. Then buildkitd is shut down (just to make sure that we can't continue pulls on background, I think generally we don't want to pull on background for data we don't need without the user knowing anyway). Now buildkitd is started and the second build runs with the exporter. It gets the cache but it can not export the image as it would need to pull more data. The same would happen if the second build is without exporter but has an additional command that needs more stargz data. |
@tonistiigi I agree with that we need to hold image refs for remotely mounted caches. In this case, I think cache's health checking is a concern. For this I think we need to define the separation of responsibilities between buildkit and snapshotter. To achieve it we first need to make the remotely mounted caches (stagz-based caches) accessible by the tuple The thing buildkit needs to do is to make sure the ref being alive when it searches the SourceOp cache. Here, we can leverage the current llbsolver which resolves the ref (= fetches the manifest) at that time, which is enough for testing the ref. If all tuples The tuple was proved to be accessible by the above step in buildkit. Then snapshotter must check that the snapshot created using the same tuple If it is finally turned out that the snapshot is unavailable because of some unpredictable reasons, bulidkit should ignore the cache and pull the whole image contents just same as standard images, which is the worst case. |
@sipsma Was thinking that maybe one way to start with this would be to make an extended snapshotter interface with a method New logic would be still be needed in the Another thing I thought was if there was a way to keep the top layer mutable after a merge. Eg. maybe the |
What would it take to allow access to the ref of the For mounts, they contain an Perhaps you have to mount type Output interface {
// ...
Partitions(context.Context) []Output
}
partitions := llb.Scratch().Run(
llb.Shlex("echo foo > /foo"),
).AddMount(
"/",
llb.Merge(
llb.Image("busybox"),
llb.Scratch(), // <- Does this map to upperdir? or should it be omitted from the arguments to `llb.Merge`?
)
).Partitions()
upperDir := partitions[len(partitions) - 1] -- Edit -- Thinking about it some more, only the st := llb.NewState(
llb.Scratch().Run(
llb.Shlex("echo foo > /foo"),
).AddMount("/", llb.Merge(llb.Image("busybox")))
) |
@hinshun Well, access to upperdir is only possible through the use of a differ. Moby used to have a special overlay differ reusing some parts of native layering but even that was removed and containerd doesn't have it either. Maybe for your use case, some feature in exec API makes more sense.
|
Yeah I think that's a good approach for keeping this change manageable in terms of size. Once we've figured out the design a bit more and the implementation starts I'll keep this in mind.
This question (basically, what should the output of an Exec on top of a merge op be?) is one of the main things I've been thinking about w/ MergeOp. Here's my thoughts (please correct me if I'm misunderstanding anything about the snapshotter model):
Overall I think in an ideal world we'd support both of those behaviors for Execs created on top of a merge-op:
The cool thing about // layer is llb.State that consists only of whatever is built by running the command "gcc foo.c" on top of the merge-op
// layer doesn't actually contain the busybox image or the other state
layer := llb.Merge(
llb.Image("busybox"),
someOtherState,
).Run(llb.Shlex("gcc foo.c")).Root()
// Calling .Run() on the return value of llb.Merge() is thus different than calling .Run() on other types of state.
// If there's concern about that, maybe a different name can be given to this version of Run instead. However, with that you can actually achieve the results of layer := llb.Merge(
llb.Image("busybox"),
someOtherState,
).Run(llb.Shlex("gcc foo.c")).Root()
// graph consists of layer plus the mounts that were originally present during the exec that created layer
graph := llb.Merge(
layer,
llb.Image("busybox"),
someOtherState,
)
// we could also add sugar for this like
graph := llb.Merge(
llb.Image("busybox"),
someOtherState,
).Run(llb.Shlex("gcc foo.c")).Graph() There are some more things to think about, including some tricky stuff surrounding the ordering of merged mounts, whiteouts and opaque directories, but curious to get everyone's thoughts on this basic idea before diving deeper.
IIUC, I think I can see the benefit of this for the case of a naive snapshotter where you just have to slow-copy instead of using actual overlayfs. Creating an exec on top of a merge-op would look like:
For the true overlayfs implementation though, I'm not sure this idea would apply, at least the way I was imagining it would work:
Let me know if I was misunderstanding though. |
No, that's not quite correct. There is one exec and exec has one parent. That parent may have multiple/different parent chains, though. The difference is that on a snapshotter that doesn't have layers, exec can not write data to multiple locations. Eg. in that case, any time exec runs every file would need to be written to as many places as are the different combinations of the mergeop sources. That is not possible. Or a differ would need to run after every exec making a tarball that is extracted to every mergeop source location, that is highly inefficient. Doing it by merging existing chains, allows to avoid slow process until it is absolutely needed and slow process is always equal the copy operation like it is today. I'm not ready to go overlay-only, and other storage methods should remain possible.
Yes, mount of a mergeop needs to make all source data available.
Again, the problem with that approach is that on a non-layer snapshotter you can not decide where this 1 layer will go. You have to fix on a parent before you can call mount.
I didn't get these examples. What is
You can't run exec on top of mutable. Exec itself needs to write to somewhere. The case I had in mind was actually releasing and reusing mutable space. Eg.
Even if copy would do a merge in here and just link the context directory directly, the context directory would now become immutable. Meaning that next time full context needs to be sent again because there is no mutable destination directory. This might be even more inefficient than today where the extra copy at least appears locally in one node.
In all cases creating mergeop should be constant speed and just store metadata(pointers to sources). Performance for accessing data may depend on the backend. Note that if you need to run a differ to capture the changes a process has made to the filesystem you can just run a differ process directly in BuildKit itself. Same should be possible for the |
@tonistiigi Thanks, I see what you're saying about some of the difficulties in getting this to work equivalently with non-layer snapshotters. I have some updated ideas now. To start, just to ensure we're on the same basic page, this is the very rough outline (no error handling, cleanup, etc) of what I'm imagining the For overlayfs backend snapshotters: package snapshot // this is the existing buildkit/snapshot package
type overlayMergedSnapshotter struct {
containerdSnapshotter // assumed to be overlayfs implementation
metadataStore // TODO doesn't exist yet, just some interface for storing metadata about merged snapshots
}
// keys are the snapshot ids that you want to merge together
func (s *overlayMergedSnapshotter) Merge(keys ...string) (id string, err error) {
id := newRandomId()
// store in metadata that id consists of keys merged together and is already committed
kind := "committed"
s.metadataStore.PutMergedKeys(id, kind, keys...)
return id, nil
}
func (s *overlayMergedSnapshotter) Prepare(ctx context.Context, key, parent string, opts ...snapshotter.Opt) ([]mount.Mount, error) {
// GetMergedKeys returns the merged keys that make-up parent, or returns an error if its not a
// merged snapshot
parentMergeKeys, err := s.metadataStore.GetMergedKeys(parent)
if err != nil {
return s.containerdSnapshotter.Prepare(ctx, key, parent, opts...)
}
// assemble the lowerdirs, which should be ordered highest layer -> lowest
var lowerdirs []string
for _, parentKey := range parentMergeKeys {
viewId := randomId()
mounts, _ := s.View(ctx, viewId, parentKey)
// getLowerDirs returns the lowerdir options from overlay mounts or just the source from bind mounts.
lowerdirs = append(lowerdirs, getLowerDirs(mounts)...)
}
upperId := randomId()
upperMounts, _ := s.containerdSnapshotter.Prepare(ctx, upperId, "") // parent id is "", new empty snapshot
// assume that upperMounts only has one mount and is a rw bind mount since it has no parents (is that too
// strong an assumption?)
upperDir := upperMounts[0].Source
workDir := // some directory on the same mount as the upperDir, may need a way to get that consistently?
// store in metadata that key consists of upperId merged on top its parents and is active
kind := "active"
mergedKeys := append([]string{upperId}, parentMergeKeys)
s.metadataStore.PutMergedKeys(id, kind, mergedKeys...)
return []mount.Mount{{
Type: "overlay",
Options: []string{"lowerdir="+strings.Join(lowerdirs, ":"), "upperdir="+upperDir, "workdir="+workDir},
}}, nil
}
func (s *overlayMergedSnapshotter) View(ctx context.Context, key, parent string, opts ...snapshotter.Opt) ([]mount.Mount, error) {
parentMergeKeys, err := s.metadataStore.GetMergedKeys(parent)
if err != nil {
return s.containerdSnapshotter.View(ctx, key, parent, opts...)
}
var lowerdirs []string
for _, parentKey := range parentMergeKeys {
viewId := randomId()
mounts, _ := s.View(ctx, viewId, parentKey)
lowerdirs = append(lowerdirs, getLowerDirs(mounts)...)
}
return []mount.Mount{{
Type: "overlay",
Options: []string{"lowerdir="+strings.Join(lowerdirs, ":")},
}}, nil
}
// TODO rest of the methods in snapshotter interface For snapshotter backends that don't support true layering (or otherwise don't work with the above implementation): package snapshot
type nativeMergedSnapshotter struct {
containerdSnapshotter
}
// There are multiple options for this implementation. We could do the laziness (that is, not actually
// performing a merge until needed) on this level or we could handle laziness on higher levels like
// CacheManager. I think I prefer doing the laziness in CacheManager at the moment, so for now Merge
// will directly create the merged snapshot here. Can be updated though.
func (s *nativeMergedSnapshotter) Merge(keys ...string) (id string, err error) {
mergedId := newRandomId()
baseKey := keys[len(keys)-1]
mergedMounts, err := s.containerdSnapshotter.Prepare(ctx, mergedId, baseKey)
for i := len(keys)-2; i--; i >=0 { // TODO obviously handle len(keys) < 2 better
viewId := getRandomId()
mounts, _ := s.View(ctx, viewId, keys[i])
// apply is similar to Applier.Apply except it takes []mount.Mount as the source instead of an oci
// descriptor. Need to figure out the best way to implement this, will mounting mounts and using
// continuity/fs.CopyDir suffice?
apply(mergedMounts, mounts)
}
id = newRandomId()
s.containerdSnapshotter.Commit(ctx, id, mergedId)
return id, nil
}
func (s *nativeMergedSnapshotter) Prepare(ctx context.Context, key, parent string, opts ...snapshotter.Opt) ([]mount.Mount, error) {
// isMergeKey returns whether parent is a merged snapshot
if !s.metadataStore.isMergeKey(parent) {
return s.containerdSnapshotter.Prepare(ctx, key, parent, opts...)
}
} Let me know if this looks close to what you're imagining, if so I'll update my client-side LLB example accordingly as it will be different than what I described previously. Given just the above, there wouldn't be support for what I and @hinshun were talking about in terms of being able to "detach" upperdirs and use them as their own entity. However I have a new idea on how to approach those use cases that could be implemented as an extension at a later time, namely by adding something like an This is still a very vague idea, I need to think about it more and want to get confirmation that how I'm approaching One other thing that I want to note here which requires more investigation as it could complicate Merge a fair bit; some behavior surrounding when the kernel creates opaque directories is going to be annoying. Namely, from my testing if you have an overlayfs mount and make a change that results in a new directory being created in the This behavior is fine in other circumstances because it's expected that an upperdir will forever and always be run on top of the same set of lowerdirs; even if the upperdir is later being used as a lowerdir in a new mount, it will still be on top of the same lowerdirs it was above previously. However, with merge-op, we are now allowing stuff like For similar reasons, I suspect merge-op will not work when the "inodes index" feature of overlays are turned on. Need to confirm this though as I haven't actually tried it yet. There might be ways to work around this, but I want to investigate it a little more and try out fixes before expanding too much; let me know if you have any thoughts in the meantime. |
Note that in BuildKit's snapshotter interface (that is a wrapper over containerd's) For naive, I'm not sure if diff/apply logic is needed or just files can be copied over with a squashing logic. The blobs should be kept separate though and not squashed/recreated.
Yeah, that sounds really bad. Maybe we need to scan to find if these exist too determine if quick-merge is possible? Any other ideas?
We can control what features are turned on while mounting. |
So is the suggestion to disabe the quick merge when there's an opaque directory and fall back to the slow merge? Definitely seems possible on a technical level, but at first thought it would be a pretty bad user experience IMO. At least with my use cases for merge-op in bincastle, previous attempts at doing slow merges were so slow and costly in terms of disk space that they were essentially unusable. Plus, the situations in which opaque directories arise aren't that uncommon but also hard to explain to someone unfamiliar with them, so trying to explain to users why their merge-ops are randomly extremely slow and taking up tons of disk space would be very difficult. On that note, I looked into opaque directories a bit more and found another situation where they get created:
This is another situation where using an upperdir with an opaque dir will lead to surprising behavior with merge-op. There's also the related issue surrounding whiteout devices in upperdirs, which can result in a path that's present in multiple merged states from either appearing or not appearing depending on the order in which they are merged, which would also be confusing in some situations. I don't have any good ideas on how to deal with this all yet, but I'll explain my thoughts in case they trigger better ideas in anyone else. Basically, my thinking is that the expected behavior for users should be that when they do Assume for a second that there's a way of creating such an overlay mount where opacity is limited in scope as described above. The next problem that will arise is when you do an exec on top of such an merged overlay mount, that exec will itself be capable of creating opaque dirs and whiteouts. This is essentially creating a new "opacity scope" where now the opaques and whiteouts created by the exec should only apply to the merged mounts present during the exec (which are themselves merged mounts with their own "opacity scopes"). So, using this terminology, you are now basically nesting opacity scopes within one another. This is obviously really complicated, but if you were able to implement it, I think you could define a merge-op that from a client perspective actually behaves more intuitively (despite all the complication underneath). When you delete/rename a path during an exec, that only overrides the paths present during the exec. If you take the result of that exec and merge it with another separate state, the opacity doesn't apply to the separate state, things will just get merged between them as expected. So, the question then is how you could implement this. I've had some ideas but they either don't quite work or have huge downsides:
I'll think about this some more but let me know what you thoughts you have too (I wouldn't be surprised if I'm way over-thinking it at this point!) |
Thought about this some more, I do have an idea that might work with reasonable performance, the only major downside I can think of would be the complexity of the implementation. Sort of similar to To explain, I noticed some helpful behavior where the opacity of a dir doesn't count if it's set on the "root" of the lowerdir. So if you have a dir So, say you have a merged mount where multiple lowerdirs in different opacity scopes contain a directory Note that this is not the same thing as that's usually referred to as "overlay-on-overlay", which typically refers to using an overlay mount as either the For whiteouts of individual files, you can follow the same approach as for opaque dirs but with the parent dir of the file that should show up in the final merged mount. Then, instead of mounting the overlay directly on top of the base overlay, you create the new overlay separately and bind mount the individual file from the new overlay on top of the base overlay, which will allow you to get copy-on-write with a single file in isolation. The only remaining complication comes from the fact that each overlay mounted on top of each other will require their own Let me know if this makes sense or if there's any other suggestions. I need to spend a little time prototyping this to make sure it's sound before moving forward, which will probably help communicate the idea a bit better too. |
This is a replacement proposal for #871 that proposed changing results to ref array to solve the same problems. The problem with that approach is that when subbuild results are reused with #1286 the client can't predict how many refs will be returned.
For example, client wants to do a subbuild and then run
touch foo
on top of that result. This is simple until subbuild returns a single ref, you can just convert that result tollb.State
and runstate.Run()
on top of it. But should it return multiple results, the client would need to detect it, convert it into multiple states and then the only thing it could do is to use slowllb.Copy
to join them into a single state that could be used as a rootfs for running the command. This is very slow, and we can't expect every client to write these complicated exceptions for every case. Subbuild likely is a black box to the main client/frontend.Instead, an alternative solution is to add a new LLB operation type
MergeOp
.MergeOp takes multiple inputs and layers them on top of each other as if files from a second input would be copied over the first one.
Eg. a Dockerfile
Could either be written as
llb.Image("alpine").Copy(llb.Local(), "/", "/")
or
llb.Merge(llb.Image("alpine"), llb.Local())
The difference is that in the latter case, we have avoided the potentially expensive copy and cache chains are now tracked independently while previously the copy had a dependency on the alpine image.
The key to making
MergeOp
more efficient thanCopy
is that the implementation should be lazily evaluating and only do expensive work when the result of the merge is needed. By defaultMergeOp
should return a reference that contains pointers to its input references without doing any work.When reference from
MergeOp
is sent to the exporter, the differ can generally work fine with individual sub-snapshots. Eg. if two images are layered on top of each other, there is no need to run the differ again. Generally, if a sub-snapshot already has calculated blobs, they don't need to be reevaluated after merge.When the merged reference does need to be mounted(eg. to run a container on top of it), things get a bit more complicated. Now it depends on the underlying snapshot implementation if this mount can be done efficiently or not. On overlay based snapshotters you can just take the lower dirs from each sub-snapshot and join them together for a single snapshot without doing a data move. For other implementations, copy can't be avoided anymore, and data needs to be duplicated now. In here, it is especially important that the copy does not happen before the mount is actually needed. Implementing this requires some significant changes to the cache ref manager. What the actual snapshot implementation does should be invisible to the LLB layer and solver/exporters.
This should also work well, lazy refs proposal #870 . It probably makes sense to implement them together. Also supporting stargz #1402 is somewhat related.
AddPrefix and TrimPrefix can be used to access or create subdirectories over the input references to support cases like
COPY /in /out
.@hinshun @sipsma
The text was updated successfully, but these errors were encountered: