Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add MergeOp to LLB #1431

Closed
tonistiigi opened this issue Apr 8, 2020 · 13 comments
Closed

Add MergeOp to LLB #1431

tonistiigi opened this issue Apr 8, 2020 · 13 comments

Comments

@tonistiigi
Copy link
Member

This is a replacement proposal for #871 that proposed changing results to ref array to solve the same problems. The problem with that approach is that when subbuild results are reused with #1286 the client can't predict how many refs will be returned.

For example, client wants to do a subbuild and then run touch foo on top of that result. This is simple until subbuild returns a single ref, you can just convert that result to llb.State and run state.Run() on top of it. But should it return multiple results, the client would need to detect it, convert it into multiple states and then the only thing it could do is to use slow llb.Copy to join them into a single state that could be used as a rootfs for running the command. This is very slow, and we can't expect every client to write these complicated exceptions for every case. Subbuild likely is a black box to the main client/frontend.

Instead, an alternative solution is to add a new LLB operation type MergeOp.

MergeOp takes multiple inputs and layers them on top of each other as if files from a second input would be copied over the first one.

Eg. a Dockerfile

FROM alpine
COPY / /

Could either be written as

llb.Image("alpine").Copy(llb.Local(), "/", "/")

or

llb.Merge(llb.Image("alpine"), llb.Local())

The difference is that in the latter case, we have avoided the potentially expensive copy and cache chains are now tracked independently while previously the copy had a dependency on the alpine image.

The key to making MergeOp more efficient than Copy is that the implementation should be lazily evaluating and only do expensive work when the result of the merge is needed. By default MergeOp should return a reference that contains pointers to its input references without doing any work.

When reference from MergeOp is sent to the exporter, the differ can generally work fine with individual sub-snapshots. Eg. if two images are layered on top of each other, there is no need to run the differ again. Generally, if a sub-snapshot already has calculated blobs, they don't need to be reevaluated after merge.

When the merged reference does need to be mounted(eg. to run a container on top of it), things get a bit more complicated. Now it depends on the underlying snapshot implementation if this mount can be done efficiently or not. On overlay based snapshotters you can just take the lower dirs from each sub-snapshot and join them together for a single snapshot without doing a data move. For other implementations, copy can't be avoided anymore, and data needs to be duplicated now. In here, it is especially important that the copy does not happen before the mount is actually needed. Implementing this requires some significant changes to the cache ref manager. What the actual snapshot implementation does should be invisible to the LLB layer and solver/exporters.

This should also work well, lazy refs proposal #870 . It probably makes sense to implement them together. Also supporting stargz #1402 is somewhat related.

message MergeOp {
    repeated Partition partitions = 1;
}

message Partition {
    int64 input = 1 [(gogoproto.customtype) = "InputIndex", (gogoproto.nullable) = false];
    string addPrefix = 2;
    string trimPrefix = 3;
}

AddPrefix and TrimPrefix can be used to access or create subdirectories over the input references to support cases like COPY /in /out.

@hinshun @sipsma

@ktock
Copy link
Collaborator

ktock commented Apr 13, 2020

This should also work well, lazy refs proposal #870 . It probably makes sense to implement them together. Also supporting stargz #1402 is somewhat related.

Thanks for the pointer to our PR! As discussed in #1402, This functionality (and especially #870's lazy refs) will work well with recent containerd's stargz-based lazy image distribution. By leveraging this functionality we can get layer snapshots for exec something on them before the entire contents locally available and will hopefully speed up building "dev"(or non-exporting) stages.

In #1402, we discussed with @tonistiigi about the following concern about lazy image distribution:

Yes, it is fine to run a container without all the snapshots/blobs locally if the implementation supports it. But at any point the record used by that container or any other part of the build should remain exportable. This also complicates caching a lot. It is possible that the cache needs to store the registry ref where the rest of the data can be pulled and loading from a cache needs to check if the new build still has access to the remote resource.

I think that if we pull layers in the background, caches don't need to hold remote refs nor check the accessibility. When we need the contents (e.g. when creating snapshots of non-stargz layers or exporting layers) and the pull hasn't completed, we just need to wait for the completion. Containerd's content store(or IngestManager) has a functionality to poll the progress of content writing(IngestManager.Status) so synchronizing pull completion isn't so hard to implement.

I think this "background pulling" can also be implemented together. I don't completely understand the whole cache design but I think I can help the implementation (or the parts related to #870/#1402) if you give me design/implementation direction.

@tonistiigi
Copy link
Member Author

I think that if we pull layers in the background, caches don't need to hold remote refs nor check the accessibility.

Example case: First build that uses stargz image runs without exporter. Stragz pulls partially and build completes. Then buildkitd is shut down (just to make sure that we can't continue pulls on background, I think generally we don't want to pull on background for data we don't need without the user knowing anyway). Now buildkitd is started and the second build runs with the exporter. It gets the cache but it can not export the image as it would need to pull more data. The same would happen if the second build is without exporter but has an additional command that needs more stargz data.

@ktock
Copy link
Collaborator

ktock commented Apr 14, 2020

@tonistiigi I agree with that we need to hold image refs for remotely mounted caches. In this case, I think cache's health checking is a concern. For this I think we need to define the separation of responsibilities between buildkit and snapshotter. To achieve it we first need to make the remotely mounted caches (stagz-based caches) accessible by the tuple (digest, ref).

The thing buildkit needs to do is to make sure the ref being alive when it searches the SourceOp cache. Here, we can leverage the current llbsolver which resolves the ref (= fetches the manifest) at that time, which is enough for testing the ref. If all tuples (digest, ref) derived by the manifest hit stargz-based caches buildkit still can use them and pull the blobs later (e.g. for exporting). In this case, buildkit needs to believe that the corresponding snapshot is also alive and this guarantee is snapshotter's responsibility.

The tuple was proved to be accessible by the above step in buildkit. Then snapshotter must check that the snapshot created using the same tuple (digest, ref) is also available until it's used or maybe during the above step. For example, it might need to reload the creds etc.

If it is finally turned out that the snapshot is unavailable because of some unpredictable reasons, bulidkit should ignore the cache and pull the whole image contents just same as standard images, which is the worst case.

@tonistiigi
Copy link
Member Author

@sipsma Was thinking that maybe one way to start with this would be to make an extended snapshotter interface with a method Merge(partition1, partition2) (id, error) from the current snapshotter interface. The extended one would return combined mounts for overlay and do actual copy for the others. This could be done as a separate step and unit tested.

New logic would be still be needed in the cache package as well, eg. for managing the parent chain after a merge of refs but I think something that merges the extracted snapshots is needed anyway and this would allow keeping the overlay specifics outside of cache pkg.

Another thing I thought was if there was a way to keep the top layer mutable after a merge. Eg. maybe the snapshotter.Merge always returns an "active" that needs to be committed. Not an important issue atm but maybe you have ideas.

@hinshun
Copy link
Collaborator

hinshun commented Aug 7, 2020

What would it take to allow access to the ref of the upperdir after an ExecOp? Not sure if out of scope, but this aspect is something I'd like to help out.

For mounts, they contain an OutputIndex of the ExecOp but I'm not sure it's as simple for MergeOp because if you chain llb.Merge(...).Run(...).Root() you lose the ability to reference the changes to llb.Merge(...)

Perhaps you have to mount MergeOp at / like so:

type Output interface {
    // ...
    Partitions(context.Context) []Output
}

partitions := llb.Scratch().Run(
    llb.Shlex("echo foo > /foo"),
).AddMount(
    "/",
    llb.Merge(
        llb.Image("busybox"),
        llb.Scratch(), // <- Does this map to upperdir? or should it be omitted from the arguments to `llb.Merge`?
    )
).Partitions()

upperDir := partitions[len(partitions) - 1]

-- Edit --

Thinking about it some more, only the upperdir of the merge is mutable, so perhaps this is unambiguous already.

st := llb.NewState(
    llb.Scratch().Run(
        llb.Shlex("echo foo > /foo"),
    ).AddMount("/", llb.Merge(llb.Image("busybox")))
)

@tonistiigi
Copy link
Member Author

@hinshun Well, access to upperdir is only possible through the use of a differ. Moby used to have a special overlay differ reusing some parts of native layering but even that was removed and containerd doesn't have it either. Maybe for your use case, some feature in exec API makes more sense.

llb.Merge(llb.Image("busybox"), llb.Scratch()) is effectively same as llb.Image("busybox") as scratch just means a nil reference.

@sipsma
Copy link
Collaborator

sipsma commented Aug 8, 2020

@sipsma Was thinking that maybe one way to start with this would be to make an extended snapshotter interface with a method Merge(partition1, partition2) (id, error) from the current snapshotter interface. The extended one would return combined mounts for overlay and do actual copy for the others. This could be done as a separate step and unit tested.

Yeah I think that's a good approach for keeping this change manageable in terms of size. Once we've figured out the design a bit more and the implementation starts I'll keep this in mind.

What would it take to allow access to the ref of the upperdir after an ExecOp? Not sure if out of scope, but this aspect is something I'd like to help out.

This question (basically, what should the output of an Exec on top of a merge op be?) is one of the main things I've been thinking about w/ MergeOp. Here's my thoughts (please correct me if I'm misunderstanding anything about the snapshotter model):

  • The snapshotter model only allows you to create trees where each layer has exactly 0 or 1 direct parents. Merge-op, on the other hand, essentially allows layers to have multiple parents, enabling you to define more general DAGs. That is, if you do an llb.Exec on top of a merge-op, you could think of the layer created during the Exec as having one parent for each of the mounts that got merged together in the merge-op.
  • With the plain snapshotter model, when you mount a given layer, you inherently mount the chain of parents under that layer too. So, it would maybe make sense for merge-op to have similar behavior; when you mount a layer created by an Exec on top of a merge-op, you also mount all its merged parents under it. I think this is the desired behavior for some use cases, such as those where the result of the Exec has runtime-dependencies on the contents of its merged parents.
  • However, what's being discussed here is the possibility of treating the result of an Exec as its own entity, detached from its merged parents and consisting only of the filesystem changes made during the Exec (the upperdir in overlay-terms). I think this is possible on a technical level.
    • For example, in bincastle (where I sort of hacked a merge-op behavior together), the result of an Exec on top of a merge-op is a snapshot consisting of just 1 layer (the diff created during the exec). This allows you to mount it just by itself or combine it together arbitrarily with other snapshots via merge-ops.
    • This is pretty important for enabling features like bootstrapping (where something that once was a runtime dependency is no longer needed) and removing layers at runtime that were only needed at build-time. I'm not totally sure yet if buildkit should do the same thing under the hood, but it's my starting idea.

Overall I think in an ideal world we'd support both of those behaviors for Execs created on top of a merge-op:

  1. The llb.State includes all of its merged parents. When it's mounted, its merged parents will be mounted under it. When it's exported, it will be exported as a layer on top of its merged parents.
  2. The llb.State exists on its own and consists of just the filesystem changes made during the Exec that created it. When it's mounted or exported, only that single layer is mounted or exported.

The cool thing about 2 above is that if you support it, I think you essentially get 1 for free too anyways. To explain, here's what I was initially imagining in terms of the LLB client. Creating an Exec on top of a merge-op would, by default, just consist of the filesystem changes made during the Exec:

// layer is llb.State that consists only of whatever is built by running the command "gcc foo.c" on top of the merge-op
// layer doesn't actually contain the busybox image or the other state
layer := llb.Merge(
    llb.Image("busybox"),
    someOtherState,
).Run(llb.Shlex("gcc foo.c")).Root()

// Calling .Run() on the return value of llb.Merge() is thus different than calling .Run() on other types of state.
// If there's concern about that, maybe a different name can be given to this version of Run instead.

However, with that you can actually achieve the results of 1 above by doing something like this:

layer := llb.Merge(
    llb.Image("busybox"),
    someOtherState,
).Run(llb.Shlex("gcc foo.c")).Root()

// graph consists of layer plus the mounts that were originally present during the exec that created layer
graph := llb.Merge(
    layer, 
    llb.Image("busybox"), 
    someOtherState,
)

// we could also add sugar for this like
graph := llb.Merge(
    llb.Image("busybox"),
    someOtherState,
).Run(llb.Shlex("gcc foo.c")).Graph()

There are some more things to think about, including some tricky stuff surrounding the ordering of merged mounts, whiteouts and opaque directories, but curious to get everyone's thoughts on this basic idea before diving deeper.

Another thing I thought was if there was a way to keep the top layer mutable after a merge. Eg. maybe the snapshotter.Merge always returns an "active" that needs to be committed. Not an important issue atm but maybe you have ideas.

IIUC, I think I can see the benefit of this for the case of a naive snapshotter where you just have to slow-copy instead of using actual overlayfs. Creating an exec on top of a merge-op would look like:

  1. Merge-op: Create a new empty snapshot and copy each of the merged snapshots into it, leave it active/mutable
  2. Exec-op: Run the exec on top of that active merged snapshot
  3. ??? Somehow calculate the diff created by the exec-op and update the mutable snapshot to consist only of that diff.
    • I'm not familiar with the differ in depth, so I don't know off the top of my head what's possible here, but intuitively it feels possible to do this (though maybe with some extra work between steps 1 and 2 to calculate the initial merged diff?)

For the true overlayfs implementation though, I'm not sure this idea would apply, at least the way I was imagining it would work:

  1. Merge-op: Don't create any new snapshots, just store in cache metadata that the snapshots of some refs should be merged
  2. Exec-op: Create a new, empty snapshot for the exec, leave it active/mutable. Make an overlay mount where you set the upperdir to be that newly created snapshot's upperdir and you set the lowerdirs to be those of the merge-op snapshots joined together. Run the exec on top of that overlay.

Let me know if I was misunderstanding though.

@tonistiigi
Copy link
Member Author

That is, if you do an llb.Exec on top of a merge-op, you could think of the layer created during the Exec as having one parent for each of the mounts that got merged together in the merge-op.

No, that's not quite correct. There is one exec and exec has one parent. That parent may have multiple/different parent chains, though. The difference is that on a snapshotter that doesn't have layers, exec can not write data to multiple locations. Eg. in that case, any time exec runs every file would need to be written to as many places as are the different combinations of the mergeop sources. That is not possible. Or a differ would need to run after every exec making a tarball that is extracted to every mergeop source location, that is highly inefficient. Doing it by merging existing chains, allows to avoid slow process until it is absolutely needed and slow process is always equal the copy operation like it is today.

I'm not ready to go overlay-only, and other storage methods should remain possible.

So, it would maybe make sense for merge-op to have similar behavior

Yes, mount of a mergeop needs to make all source data available. Merge(a, b) is same data as copy(b, a, "/", "/").

For example, in bincastle (where I sort of hacked a merge-op behavior together), the result of an Exec on top of a merge-op is a snapshot consisting of just 1 layer

Again, the problem with that approach is that on a non-layer snapshotter you can not decide where this 1 layer will go. You have to fix on a parent before you can call mount.

However, with that you can actually achieve the results of 1 above by doing something like this:

I didn't get these examples. What is Graph() supposed to do? And why are you merging busybox on top of gcc results?

Exec-op: Run the exec on top of that active merged snapshot

You can't run exec on top of mutable. Exec itself needs to write to somewhere. The case I had in mind was actually releasing and reusing mutable space.

Eg.

from alpine
copy / /

Even if copy would do a merge in here and just link the context directory directly, the context directory would now become immutable. Meaning that next time full context needs to be sent again because there is no mutable destination directory. This might be even more inefficient than today where the extra copy at least appears locally in one node.

Merge-op: Don't create any new snapshots, just store in cache metadata that the snapshots of some refs should be merged

In all cases creating mergeop should be constant speed and just store metadata(pointers to sources). Performance for accessing data may depend on the backend.


Note that if you need to run a differ to capture the changes a process has made to the filesystem you can just run a differ process directly in BuildKit itself. Same should be possible for the Exec API @coryb is building. Maybe (no promises) I'd be open to add diffing as a native method so overlay case could be faster. Although fast diffing in overlay was determined to be so unstable it was removed from Moby and containerd maintainers determined it was impossible to get right and refused to write it. This case is a bit simpler as no tarballs would be needed but not sure what issues may come up if we would allow to just pick a random overlay lower and put it in another random location. The current overlay snapshotter would be completely unusable then as well.

@sipsma
Copy link
Collaborator

sipsma commented Aug 17, 2020

@tonistiigi Thanks, I see what you're saying about some of the difficulties in getting this to work equivalently with non-layer snapshotters. I have some updated ideas now.

To start, just to ensure we're on the same basic page, this is the very rough outline (no error handling, cleanup, etc) of what I'm imagining the Merge method of the extended snapshotter interface you mentioned would look like. There would be different implementations depending on the backend snapshotter being used (though that will be abstracted away from CacheManager obviously):

For overlayfs backend snapshotters:

package snapshot // this is the existing buildkit/snapshot package

type overlayMergedSnapshotter struct {
    containerdSnapshotter // assumed to be overlayfs implementation
    metadataStore // TODO doesn't exist yet, just some interface for storing metadata about merged snapshots
}

// keys are the snapshot ids that you want to merge together
func (s *overlayMergedSnapshotter) Merge(keys ...string) (id string, err error) {
    id := newRandomId()
    // store in metadata that id consists of keys merged together and is already committed
    kind := "committed"
    s.metadataStore.PutMergedKeys(id, kind, keys...)
    return id, nil
}

func (s *overlayMergedSnapshotter) Prepare(ctx context.Context, key, parent string, opts ...snapshotter.Opt) ([]mount.Mount, error) {
    // GetMergedKeys returns the merged keys that make-up parent, or returns an error if its not a
    // merged snapshot
    parentMergeKeys, err := s.metadataStore.GetMergedKeys(parent)
    if err != nil {
                return s.containerdSnapshotter.Prepare(ctx, key, parent, opts...)
    }

    // assemble the lowerdirs, which should be ordered highest layer -> lowest
    var lowerdirs []string
    for _, parentKey := range parentMergeKeys {
        viewId := randomId()
        mounts, _ := s.View(ctx, viewId, parentKey)
        // getLowerDirs returns the lowerdir options from overlay mounts or just the source from bind mounts.
        lowerdirs = append(lowerdirs, getLowerDirs(mounts)...)
    }

    upperId := randomId()
    upperMounts, _ := s.containerdSnapshotter.Prepare(ctx, upperId, "") // parent id is "", new empty snapshot
    // assume that upperMounts only has one mount and is a rw bind mount since it has no parents (is that too
    // strong an assumption?)
    upperDir := upperMounts[0].Source
    workDir := // some directory on the same mount as the upperDir, may need a way to get that consistently?

    // store in metadata that key consists of upperId merged on top its parents and is active
    kind := "active"
    mergedKeys := append([]string{upperId}, parentMergeKeys)
    s.metadataStore.PutMergedKeys(id, kind, mergedKeys...)

    return []mount.Mount{{
        Type: "overlay",
        Options: []string{"lowerdir="+strings.Join(lowerdirs, ":"), "upperdir="+upperDir, "workdir="+workDir},
    }}, nil
}

func (s *overlayMergedSnapshotter) View(ctx context.Context, key, parent string, opts ...snapshotter.Opt) ([]mount.Mount, error) {
    parentMergeKeys, err := s.metadataStore.GetMergedKeys(parent)
    if err != nil {
                return s.containerdSnapshotter.View(ctx, key, parent, opts...)
    }

    var lowerdirs []string
    for _, parentKey := range parentMergeKeys {
        viewId := randomId()
        mounts, _ := s.View(ctx, viewId, parentKey)
        lowerdirs = append(lowerdirs, getLowerDirs(mounts)...)
    }

    return []mount.Mount{{
        Type: "overlay",
        Options: []string{"lowerdir="+strings.Join(lowerdirs, ":")},
    }}, nil
}

// TODO rest of the methods in snapshotter interface

For snapshotter backends that don't support true layering (or otherwise don't work with the above implementation):

package snapshot

type nativeMergedSnapshotter struct {
    containerdSnapshotter
}

// There are multiple options for this implementation. We could do the laziness (that is, not actually
// performing a merge until needed) on this level or we could handle laziness on higher levels like
// CacheManager. I think I prefer doing the laziness in CacheManager at the moment, so for now Merge
// will directly create the merged snapshot here. Can be updated though.
func (s *nativeMergedSnapshotter) Merge(keys ...string) (id string, err error) {
    mergedId := newRandomId()
    baseKey := keys[len(keys)-1]
    mergedMounts, err := s.containerdSnapshotter.Prepare(ctx, mergedId, baseKey)
    for i := len(keys)-2; i--; i >=0 { // TODO obviously handle len(keys) < 2 better
        viewId := getRandomId()
        mounts, _ := s.View(ctx, viewId, keys[i])
        // apply is similar to Applier.Apply except it takes []mount.Mount as the source instead of an oci
        // descriptor. Need to figure out the best way to implement this, will mounting mounts and using
        // continuity/fs.CopyDir suffice?
        apply(mergedMounts, mounts)
    }
    id = newRandomId()
    s.containerdSnapshotter.Commit(ctx, id, mergedId)
    return id, nil
}

func (s *nativeMergedSnapshotter) Prepare(ctx context.Context, key, parent string, opts ...snapshotter.Opt) ([]mount.Mount, error) {
    // isMergeKey returns whether parent is a merged snapshot
    if !s.metadataStore.isMergeKey(parent) {
        return s.containerdSnapshotter.Prepare(ctx, key, parent, opts...)
    }
}

Let me know if this looks close to what you're imagining, if so I'll update my client-side LLB example accordingly as it will be different than what I described previously.


Given just the above, there wouldn't be support for what I and @hinshun were talking about in terms of being able to "detach" upperdirs and use them as their own entity. However I have a new idea on how to approach those use cases that could be implemented as an extension at a later time, namely by adding something like an Unmerge op that allows you to take a merged state and get a new state where a subset of the previously merged states are removed. That would support bincastle's use-case of removing build-time only dependencies from run-time graphs and I suspect hinshun's use-case could be satisfied too. Implementing it for layering snapshotters like overlayfs would probably be straightforward; non-layers snapshotters like native would obviously be more difficult and/or extremely slow, but maybe something could be figured out.

This is still a very vague idea, I need to think about it more and want to get confirmation that how I'm approaching Merge makes sense before expanding on what Unmerge would possibly look like.


One other thing that I want to note here which requires more investigation as it could complicate Merge a fair bit; some behavior surrounding when the kernel creates opaque directories is going to be annoying. Namely, from my testing if you have an overlayfs mount and make a change that results in a new directory being created in the upperdir that does not exist in any lowerdirs, that directory will be marked as opaque by the kernel.

This behavior is fine in other circumstances because it's expected that an upperdir will forever and always be run on top of the same set of lowerdirs; even if the upperdir is later being used as a lowerdir in a new mount, it will still be on top of the same lowerdirs it was above previously.

However, with merge-op, we are now allowing stuff like llb.Merge(llb.Image("foo"), llb.Image("bar")), so say foo and bar are independently created images and there's a directory /dir that is marked as opaque in both images. Now, when the merge happens, /dir is opaque, meaning the contents of /dir will not be merged and the overlay mount will only show the contents of /dir from one of the images, not both.

For similar reasons, I suspect merge-op will not work when the "inodes index" feature of overlays are turned on. Need to confirm this though as I haven't actually tried it yet.

There might be ways to work around this, but I want to investigate it a little more and try out fixes before expanding too much; let me know if you have any thoughts in the meantime.

@tonistiigi
Copy link
Member Author

Note that in BuildKit's snapshotter interface (that is a wrapper over containerd's) Prepare does not return mounts. https://github.com/moby/buildkit/blob/master/snapshot/snapshotter.go#L24 . So it can be a completely a metadata operation. Also, where mounts are used the interface returns Mountable so the lifecycle of the mounts can be managed (eg. when binds or double mounts are needed).

For naive, I'm not sure if diff/apply logic is needed or just files can be copied over with a squashing logic. The blobs should be kept separate though and not squashed/recreated.

some behavior surrounding when the kernel creates opaque directories is going to be annoying

Yeah, that sounds really bad. Maybe we need to scan to find if these exist too determine if quick-merge is possible? Any other ideas?

will not work when the "inodes index" feature of overlays are turned on

We can control what features are turned on while mounting.

@sipsma
Copy link
Collaborator

sipsma commented Aug 18, 2020

Yeah, that sounds really bad. Maybe we need to scan to find if these exist too determine if quick-merge is possible? Any other ideas?

So is the suggestion to disabe the quick merge when there's an opaque directory and fall back to the slow merge? Definitely seems possible on a technical level, but at first thought it would be a pretty bad user experience IMO. At least with my use cases for merge-op in bincastle, previous attempts at doing slow merges were so slow and costly in terms of disk space that they were essentially unusable. Plus, the situations in which opaque directories arise aren't that uncommon but also hard to explain to someone unfamiliar with them, so trying to explain to users why their merge-ops are randomly extremely slow and taking up tons of disk space would be very difficult.

On that note, I looked into opaque directories a bit more and found another situation where they get created:

  1. Start out with a lowerdir containing a directory, call it /dir
  2. Create an overlayfs with that lowerdir and an upperdir, then delete /dir on the merged mount.
    • The upperdir will now show a whiteout device at /dir.
  3. Now go make a new directory at /dir on the merged mount
    • The upperdir will now show an actual directory at /dir instead of the whiteout device, but /dir will have the opaque xattr (to prevent the previous /dir in the lowers from showing up)

This is another situation where using an upperdir with an opaque dir will lead to surprising behavior with merge-op.

There's also the related issue surrounding whiteout devices in upperdirs, which can result in a path that's present in multiple merged states from either appearing or not appearing depending on the order in which they are merged, which would also be confusing in some situations.


I don't have any good ideas on how to deal with this all yet, but I'll explain my thoughts in case they trigger better ideas in anyone else.

Basically, my thinking is that the expected behavior for users should be that when they do llb.Merge(llb.Image("foo"), llb.Image("bar")), the result should just be the merge of what they see when they look at the mounted foo or mounted bar image. So, it's okay for there to be opacity within foo and opacity within bar, but the opacity should be limited to each image; the opaques+whiteouts within foo should not impact the view of bar and vice versa. In short, opacity should be limited in scope.

Assume for a second that there's a way of creating such an overlay mount where opacity is limited in scope as described above. The next problem that will arise is when you do an exec on top of such an merged overlay mount, that exec will itself be capable of creating opaque dirs and whiteouts. This is essentially creating a new "opacity scope" where now the opaques and whiteouts created by the exec should only apply to the merged mounts present during the exec (which are themselves merged mounts with their own "opacity scopes"). So, using this terminology, you are now basically nesting opacity scopes within one another.

This is obviously really complicated, but if you were able to implement it, I think you could define a merge-op that from a client perspective actually behaves more intuitively (despite all the complication underneath). When you delete/rename a path during an exec, that only overrides the paths present during the exec. If you take the result of that exec and merge it with another separate state, the opacity doesn't apply to the separate state, things will just get merged between them as expected.

So, the question then is how you could implement this. I've had some ideas but they either don't quite work or have huge downsides:

  1. You are allowed to set one overlay mount as the lowerdir of another, but with the crucial (and undocumented?) limitation of only up to a depth of 2 (that is you can create overlay1 and use it as a lowerdir of overlay2, but you can't use overlay2 as the lowerdir of a new overlay3).
    • Each time you create an overlay mount you are essentially defining an opacity scope, but the fact that you are limited to a stacking depth of 2 means you can only do one level of nesting those scopes, so I don't think it will work
  2. You could scan merge mounts to analyze the trees and find situations where opacity will be incorrectly applied and try to fix it up
    • This is really tricky though. You can't modify lowerdirs (even xattrs) since they should be immutable. The only generic solution I can think of would be to find conflicts and then create a new lowerdir that results in the correct output being shown.
    • So, for example, if you see that /dir is set to opaque across different opacity scopes, you could create a new lowerdir that goes above the conflicting scopes and contains the correct merged result of /dir.
    • The problem with the above is that you're essentially re-implementing the merge logic of overlayfs and in the worst cases re-creating the entire tree of merged lowerdirs. It could also either waste a ton of disk space due to copying or use up a ton of inodes due to hardlinking (bind-mounts won't work due to overlayfs implementation).
  3. You could of course write a fuse fs that implements this idea (or write a plugin for fuse-overlayfs), but forcing users to use that as their snapshotter backend would obviously be something preferable to avoid

I'll think about this some more but let me know what you thoughts you have too (I wouldn't be surprised if I'm way over-thinking it at this point!)

@sipsma
Copy link
Collaborator

sipsma commented Aug 23, 2020

Thought about this some more, I do have an idea that might work with reasonable performance, the only major downside I can think of would be the complexity of the implementation.

Sort of similar to 2 in my previous comment, you can scan the mounts being merged together to find places on the final merged fs tree where opaques+whiteouts would result in unexpected behavior. The new idea is to fix those up not by doing any sort of copying or linking but by creating mounts on top of the base overlay.

To explain, I noticed some helpful behavior where the opacity of a dir doesn't count if it's set on the "root" of the lowerdir. So if you have a dir lower/foo, where foo is opaque, then the opacity counts if you set lowerdir=lower, but it doesn't if you set lowerdir=lower/foo, because then technically foo isn't actually contained within the lowerdir. See this gist for an example.

So, say you have a merged mount where multiple lowerdirs in different opacity scopes contain a directory /foo that is opaque in each of them. You can determine which /foo dirs should actually be merged together and which should be ignored, then create a separate overlay mount for just /foo using the lowers that should be merged. Then, the final mounts you return are an overlay for / and that separate overlay that gets mounted on top of the first overlay at /foo.

Note that this is not the same thing as that's usually referred to as "overlay-on-overlay", which typically refers to using an overlay mount as either the lowerdir or upperdir of another overlay. All you do here is create one overlay and then mount another one on top of it, which the kernel has no restrictions on.

For whiteouts of individual files, you can follow the same approach as for opaque dirs but with the parent dir of the file that should show up in the final merged mount. Then, instead of mounting the overlay directly on top of the base overlay, you create the new overlay separately and bind mount the individual file from the new overlay on top of the base overlay, which will allow you to get copy-on-write with a single file in isolation.

The only remaining complication comes from the fact that each overlay mounted on top of each other will require their own upperdir, whereas we want the end effect of the whole mount having a single upperdir. Using overlapping upperdir paths leads to undefined behavior, so I think the only option will be to actually use separate upperdir paths for each overlay and then, as part of Commit, combine those upperdirs together into the expected final state. Given that these upperdirs should all be on the same filesystem, you should be able to combine them efficiently via rename(2).

Let me know if this makes sense or if there's any other suggestions. I need to spend a little time prototyping this to make sure it's sound before moving forward, which will probably help communicate the idea a bit better too.

@tonistiigi
Copy link
Member Author

#2335

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

4 participants