WIP generic object spec #37

justincormack · 2021-03-26T18:52:38Z

Previously at https://gist.github.com/justincormack/523dc229f0dd7b882edf19c60aed1581

Here for more detailed comments.

Signed-off-by: Justin Cormack justin@specialbusservice.com

…edf19c60aed1581 Here for more detailed comments. Signed-off-by: Justin Cormack <justin@specialbusservice.com>

dlorenc · 2021-03-26T19:08:03Z

universal.md

+
+- **`mediaType`** *string*
+
+  This field contains the `mediaType` of this document. This MUST be `application/vnd.oci.object.manifest.v1+json` for the JSON encoding or `application/vnd.oci.object.manifest.v1+jwt` for JSON web signature (RFC7515) encoding. The server needs this to return the media type to the application requesting this object.


I'm not following the jwt media type here. What would this look like and be used for?

This was the source of a long discussion on how to define a unique type, without having to make OCI be a clearing-house.
As a result, we went with the iana.org registrations, with the process outlined here: Defining a Unique Artifact Type

Sorry - I understand what the media type is for. I don't understand what it would mean to set it as a JWT.

The media type may be incorrect maybe I misread it. We have a choice about using the compact serialisation for JWS (base 64 dot separated), or the JSON serialisation. Thinking about it the JSON one probably makes more sense ie

In the JWS JSON Serialization, a JWS is represented as a JSON object
containing some or all of these four members:

o "protected", with the value BASE64URL(UTF8(JWS Protected Header))
o "header", with the value JWS Unprotected Header
o "payload", with the value BASE64URL(JWS Payload)
o "signature", with the value BASE64URL(JWS Signature)

You still have to base64 decode the payload to get the JSON for the actual artifact.

If we use the JSON version, maybe the media type should be application/vnd.oci.object.manifest.v1+jws+json or something to distinguish.

We could add other inline signature formats, there are not many existing standards, which is why I am suggesting JWS.

I think I'm still missing something here. Is there another doc or something that gets into the signatures more?

This sort of seems like a cross between the manifest 2 schema 1 stuff and Notary V1, but I'm mostly guessing.

dlorenc · 2021-03-26T21:30:52Z

universal.md

+
+Now let us look at how we map various existing and proposed types to this manifest format.
+
+The simplest type is the Pointer that simply points to another object. This looks useless, but if it is signed it can act as a detached signature for the item that it points at.


In this example, where would the signature be stored?

In the JWS signature, not shown here (see previous comment)

jonjohnsonjr

This would be easier to review if each sentence were on a separate line, as per the markdown style guidelines.

Overall, I think this introduces an additional layer of complexity unnecessarily. I believe it would be possible to introduce the same extension points or functionality to the existing data structures. I'd like to see some discussion of why we can't modify content descriptor instead.

I also don't see any discussion of how to roll this out. It's great that this might prevent us from having to make breaking changes in the future, but this proposal itself is a pretty large change. Given that there are clients around still pushing and pulling schema 1 images, I'm somewhat reluctant to # for another migration.

I am fine with making huge changes like this, but only if we go into it with our eyes open. Realistically, this will take 1-5 years before we can take advantage of it without some clever backward-compat story, and if we do have a clever backward-compat story, do we need to make these changes? :)

jonjohnsonjr · 2021-03-29T21:16:57Z

universal.md

+
+So to meet both use cases, we should have a format that has a lot of extensible metadata but it has a very simple structure for the operator to use for management operations. Let us look at the management side data model first.
+
+- **`mediaType`** *string*


Embedding this field is a regression from opencontainers/image-spec#411

There is some good discussion in that PR around whether or not this was a good idea, but I'd like to see some justification for walking back that decision.

cc @vbatts @stevvooe who might have more context here

In the next line:

The server needs this to return the media type to the application requesting this object.

Traditionally, this has been communicated to the registry via the Content-Type header. Why can't we use that?

Happy to remove this if it is not needed, it makes things simpler.

jonjohnsonjr · 2021-03-29T21:19:11Z

universal.md

+
+- **`blobs`** *array of objects*
+
+  An optional set of references to blobs. Each object MUST have a descriptor. From the server point of view the only pieces that matter for the descriptor are **`digest`**, used to find the object in the content store, and **`size`** which is used to check the expected size if the server needs to fetch the object, and potentially for hash collision attack detection. The server data model does not need to know the media type, as it simply needs to know this is a blob so it does not need to track further links. The **`urls`** property in the descriptor, and the **`annotations`** are just for client side use. Note there is no seperate model for `config` here; as far as the server data model is concerned this is simply another blob. Below we will discuss how the client distinguishes the different sorts of blob. The server is also not concerned about the ordering of blobs or references, so it can store these in a relation.


Suggested change

An optional set of references to blobs. Each object MUST have a descriptor. From the server point of view the only pieces that matter for the descriptor are **`digest`**, used to find the object in the content store, and **`size`** which is used to check the expected size if the server needs to fetch the object, and potentially for hash collision attack detection. The server data model does not need to know the media type, as it simply needs to know this is a blob so it does not need to track further links. The **`urls`** property in the descriptor, and the **`annotations`** are just for client side use. Note there is no seperate model for `config` here; as far as the server data model is concerned this is simply another blob. Below we will discuss how the client distinguishes the different sorts of blob. The server is also not concerned about the ordering of blobs or references, so it can store these in a relation.

An optional set of references to blobs. Each object MUST have a descriptor. From the server point of view the only pieces that matter for the descriptor are **`digest`**, used to find the object in the content store, and **`size`** which is used to check the expected size if the server needs to fetch the object, and potentially for hash collision attack detection. The server data model does not need to know the media type, as it simply needs to know this is a blob so it does not need to track further links. The **`urls`** property in the descriptor, and the **`annotations`** are just for client side use. Note there is no separate model for `config` here; as far as the server data model is concerned this is simply another blob. Below we will discuss how the client distinguishes the different sorts of blob. The server is also not concerned about the ordering of blobs or references, so it can store these in a relation.

jonjohnsonjr · 2021-03-29T21:22:26Z

universal.md

+1. Any number of objects can be stored in a manifest. This can include different versions of the same image format, either for schema upgrades, or different versions for different architectures as we have for image index now. Other use cases are possible such as an index document that simply covers a number of blobs, or a Helm chart that has the chart information and a set of images in the standrad image format in the same manifest.
+2. Clients rank their preference for objects and search for a match, much as happens with multi arch now. Clients will generally prefer newer formats over old ones, and their specific architecture matches over worse ones. Generic clients might have user configuration. Note that multi arch can be included in a single manifest rather than indirecting to image manifests, making fetching more efficient.
+3. A few generic object types are supported. Notably these include "Pointer", "Relation" and "Property". Pointer is simply an object that points at another one which should be dereferenced. A signed Pointer object can be used for a detached signature. Relation is a pointer, a type, and a second object which is related, used for adding metadata such as an SBOM to an existing object; again a client that is not interested in the SBOM should just follow the link to the primary object. "Property" is like a pointer but is stored within a manifest and links to a property of that manifest itself, used if you want to say store an SBOM at build time in the same object, to avoid an extra redirection.
+4. We support items in a manifest that have neither blobs nor references, so that the server side does not care about these. This allows manifests to do things like reference foreign layers or images by tag that are not actually references at all, but make the client side processing more consistent.


Can you add an example somewhere of how your proposal would handle microsoft layers differently from OCI?

Are you referring to foriegn layers, used by windows images?

jonjohnsonjr · 2021-03-29T21:31:58Z

universal.md

+
+This is all the data that is needed to manage the content store from the registry operators side, and this can be used as the data model. It is easy to see how existing image manifests and index can be mapped down to this data model. However, it is too minimal for the client to be able to process it so we need to enhance it with metadata. This is where the design space gets much more complicated. We will have simple processing rules that let the server simply extract this data model from a format that is more suitable for client processing.
+
+The existing formats assume that one document corresponds to one version of one specification. This has caused all sorts of issues, such as how to add new forms of compression to container images in a way that new clients can get enhancements while old clients continue to work. The only partial upgrade path that has worked has been for multi arch images, where an additional layer of indirection is required for a client to make a choice of which version it supports, although here the specification is very rigid on the set of allowed choices and existing clients only worked if pointed at the single architecture image. Because the registry is a content addressed store, content negotiation does not work well either, as the client often needs to check the content hash; the only place where it has been used was on upgrading to content adressablity, and it should not be used again.


This has caused all sorts of issues, such as how to add new forms of compression to container images

I'd like to present the idea that not all protocol extensibility needs to happen at the level of the manifest format. For this problem in particular, I believe @sargun has presented the most compelling solution for doing this as a transport-level implementation detail, rather than at the artifact level: opencontainers/distribution-spec#235

Of course, that doesn't mean we can't do that at the document level -- I've presented a similar idea of just adding a list of "alternatives" for a given descriptor when discussing zstd: opencontainers/image-spec#803 (comment)

The only partial upgrade path that has worked has been for multi arch images, where an additional layer of indirection is required for a client to make a choice of which version it supports, although here the specification is very rigid on the set of allowed choices

Can you link to this? I don't see the spec as rigid, it says:

The image index is a higher-level manifest which points to specific image manifests, ideal for one or more platforms.

So while the original use case was for multi-platform images, I don't see anything indicating rigidity of client implementation.

Because the registry is a content addressed store, content negotiation does not work well either, as the client often needs to check the content hash; the only place where it has been used was on upgrading to content adressablity, and it should not be used again.

I think this is a bit to strict, and maybe it's because the compression example is a particularly bad one, but it's very possible to do some negotiation of content at transport time without breaking the CAS.

agree to the spec leaving open both transport and artifact level compression/encryption...

jonjohnsonjr · 2021-03-29T21:38:24Z

universal.md

+
+Probably the best developed system along these lines is git. Git makes some different tradeoffs than registries. In particular it is optimised for handling many small (text) files, and serving these on a relatively small scale. There are only four types of content, commits, trees, blobs and annotated tags. Files are aggregated by into larger chunks (packfiles with indexes) in order to serve them more efficiently, and this is done dynamically to reduce network traffic. Git has well developed ways of handling metadata, which will be contrasted later. Like registries, the object model has got more complex over times, with trees now being able to point at commits not just trees or blobs. A registry is generally designed to handle larger files, with fewer links between them, and much higher volumes of traffic so the server is required to do less work. Over time there is likely to be a convergence, as there are use cases for which the git model of more aggressive file deduplication is useful (for example constrained network use cases), and we are likely to see more traffic akin to branch traversals in registries.
+
+Registry formats started off just for a very specific container image use case. Originally there was one special document type, the image manifest, which pointed at a list of blobs, and a configuration blob. To find all the referenced blobs, you only had to track all the image manifests that had tags pointing at them, and then find all the blobs pointed at by those. Then the image index was added, which pointed at image manifests, for multi architecture selection. Technically this could also point at other image manifests, although this has not used much. A garbage collector had to traverse these additional links as well. The issue with these two formats is that they were very specific to container images, and although they have some ability to add generic metadata, it is difficult to adapt them to new types of stored data. The OCI Artifacts specification still requires objects to be defined as a configuration blob and a list of blobs. The registry may be expected to parse the configuration blob for display purposes. However there is no way to define objects that point to other objects, rather than blobs, so the format is not very generic. An [OCI artifact manifest](https://github.com/opencontainers/artifacts/pull/29) has been proposed that supports blobs and references, but even that is not fully generic, and more formats a re likely to be needed in future.


The issue with these two formats is that they were very specific to container images, and although they have some ability to add generic metadata, it is difficult to adapt them to new types of stored data.

I don't know that I really agree with this, given that many folks are happily doing it.

s/although this has not used much/although this has not been used much/

s/a re/are/

s/in future/in the future/

jonjohnsonjr · 2021-03-29T22:08:18Z

universal.md

+
+  A REQUIRED version string for the type. Clients will normally look for an object with the latest version they understand, and fall back to older versions if they support them. Different versions may have different structures.
+
+- **`filters`** *map string - string*


Why wouldn't clients just use annotations for this?

Semantically we could. I do like the ability for generic processing tools to be able to list the available filters, eg for presentation purposes though, say for a drop down menu. With annotations it is harder - are all annotations here semantically filters? We could decide they are. Am ok either way.

If filters are "commonly used keys", wouldn't that infer that the spec also include a list of commonly used keys?

jonjohnsonjr · 2021-03-29T22:08:32Z

universal.md

+
+- **`filters`** *map string - string*
+
+  An array of OPTIONAL keys and values for selecting among multiple object options that a client has. These are filters that the client will use to select which version of an image is most appropriate, for example selecting by architecture ro any other appropriate choice. Common keys and values should be used where possible, such as `org.oci.architecture` so that common code and documentation can be reused. These are collected into a single map so a user interface can display them without understanding the object type.


Suggested change

An array of OPTIONAL keys and values for selecting among multiple object options that a client has. These are filters that the client will use to select which version of an image is most appropriate, for example selecting by architecture ro any other appropriate choice. Common keys and values should be used where possible, such as `org.oci.architecture` so that common code and documentation can be reused. These are collected into a single map so a user interface can display them without understanding the object type.

An array of OPTIONAL keys and values for selecting among multiple object options that a client has. These are filters that the client will use to select which version of an image is most appropriate, for example selecting by architecture or any other appropriate choice. Common keys and values should be used where possible, such as `org.oci.architecture` so that common code and documentation can be reused. These are collected into a single map so a user interface can display them without understanding the object type.

jonjohnsonjr · 2021-03-29T22:11:56Z

universal.md

+
+  Standard OCI descriptor. If the component is not a `blob` or `reference` this will be ignored but the client can use it, eg for a foreign layer or for some other reference type that still wants to verify a hash, such as a bridge to another system.
+
+- **`ctype`** *string*


Why have type, rtype, and ctype? I guess descriptor here would have a mediaType embedded as well, if present. I feel like I'm reading some compiler source code or some C++ documentation ;)

I think it will be challenging to talk about these objects with the current terminology, but that's mostly a taste thing and not a technical objection, so I'll not bikeshed on that too much.

Sorry, I am terrible at inventing names for things on the fly, definitely call them something else.

jonjohnsonjr · 2021-03-29T22:14:44Z

universal.md

+
+The `components` of an object correspond to individual parts, potentially of multiple types, for example an image has a config and some layers. The parts that are needed to extract generic data are specified but again any type specific data can be added.
+
+- **`rtype`** *string*


I like the idea of a "reference type", and that's something I'd be interested in adding to OCI Content Descriptors.

The terminology here is a bit confusing, you have rtype as the "reference type" which can be either be "blob" or "reference". I got stuck here. Would "reference" imply this points to a non-opaque thing, e.g. another application/vnd.oci.object.manifest.v1+json? What about a manifest? Anything else?

Some diagrams would help me a lot here.

I'd be interested in an example that doesn't include rtype. This would be for things outside the DAG?

jonjohnsonjr · 2021-03-29T22:15:34Z

universal.md

+}
+```
+Note that we should get clients to support this generic manifest for images too so we can transition and support upgradeability.
+```


The indentation here down is a bit broken.

SteveLasker · 2021-03-30T00:25:19Z

I also don't see any discussion of how to roll this out. It's great that this might prevent us from having to make breaking changes in the future, but this proposal itself is a pretty large change. Given that there are clients around still pushing and pulling schema 1 images, I'm somewhat reluctant to # for another migration.

The idea of implementing a new manifest schema is it allows artifact clients to opt-into them.

Here's the inverse question:
How do we enable artifact signing, including image signing, without changing the existing image toolchains?

The premise of signing verification is its opt-in. The existing container runtimes continue to function, without changes.
If you want to implement gated execution, you'll need some signing verification to occur.

This can be done:

By a gate, before the host is asked to run net-monitor:v1. This is the OPA/Gatekeeper validation. If the gate says it's ok to run (ingress controller validation), the k8s host is requested to run the image. The only change to the existing tool chains is an ingress controller validation, which is new code, new flows, and non-breaking to the existing image runtimes.
By a client/host, which means it's new code to be deployed, which would know to check for signatures and verify them. Let's say containerd supported signature validation. A containerd plug-in could be added that does the same thing as the opa/gatekeeper validation. As the request comes to containerd, the signature is requested and verified. Only if it passes, does the containerd host run the image. The host code that pulls and runs the image didn't change, it was the plug-in that saw the request for net-monitor:v1 and used the /v2/_ext/oci-artifacts/v1/<repo>/manifests/<digest>/references?artifact-type=vnd.cncf.notaryv2 API to find signatures, pull them, verify them, before allowing the host to proceed.

This allows us to proceed with:

storing signatures
linking signatures to the thing they're signing
tracking for garbage collection of content
image toolchains to have zero change, zero risk of breaking them
when image toolchains want to start using this format, they can.

Image toolchain updates:

We know folks want to add new compression formats. As those folks want to experiment, they can use this new format. This will likely take more time, and that's ok, as we've at least given them a way forward, as it appears there are more problems than answers for the compression format discussions at current.

There are some thoughts about how we can down-convert the new format to the existing format, but I don't think we know everything here. It's an example of what we need to validate.

However, the new manifest approach allows us to decouple any existing client changes, with the ability for the existing clients to opt-in.

The biggest change is really for existing registries to implement this change.

By adding a new manifest, or anything new including a new version of the image-spec, the registry can make an explicit statement of whether it supports the new behavior.

This is actually the biggest positive, as the most expensive change for a registry to implement is garbage collection. If we're going to implement new garbage collection and ref-counting semantics, it would be nice to have some focus on one change that enables a few scenarios. Today, we have garbage collection loosely defined (at best) for two manifests (image-manifest, image-index). Wouldn't it be nice to start defining the user expectations around GC?

fuweid

I like the idea of For that reason this proposal has a much more generic approach, such that a document can contain several different versions of a single piece of content, suitable for different users or clients.. Just added some comments and wait for the PR to be RFC. Thanks!

fuweid · 2021-03-31T15:58:50Z

universal.md

+
+  An array of OPTIONAL keys and values for selecting among multiple object options that a client has. These are filters that the client will use to select which version of an image is most appropriate, for example selecting by architecture ro any other appropriate choice. Common keys and values should be used where possible, such as `org.oci.architecture` so that common code and documentation can be reused. These are collected into a single map so a user interface can display them without understanding the object type.
+
+- **`components`** *array of components*


It seems ctype is duplicated with descriptor.mediaType.

And it looks like that the references and blobs can be in the same components array and no order required. Is it possible to separate the references and blobs from components? So that we can remove rtype which might be duplicated with descriptor.mediaType.

fuweid · 2021-03-31T16:07:27Z

universal.md

+5. User interface specific items, such as links to icons for the image to be displayed can be added as an additional object that the client will ignore but can be used by the registry UI only.
+6. The manifest as a whole does not have a type, only the objects in it.
+
+- **`schemaVersion`** *int*


schemaVersion looks like conflicted with objects.version. could you provide example about diff objects.version but same schemaVersion?

IMO, it can be described by mediaType because the manifest is managed by object author.

mikebrow

Nice treatise..

mikebrow · 2021-03-31T22:38:43Z

universal.md

+A registry is a data store for content addressed data. At the lowest level it is just a large key (hash) to value (document) store, but one of the things we have learned about data stores is that they are more useful if they allow for structured data (a Merkle tree, technically a dag), with data being allowed to point to (hashes of) other data items. The complexity added by this is that if the data store needs to follow the links in documents it needs to know how to parse them. The primary use case for this parsing is for garbage collection: the usual storage model allows for an object that is not referenced internally or from a tag (a generic name that can be given to items in the store) may be garbage collected. Without this it is difficult to remove any items from the store. Tags exist to give human friendly names and to anchor items into the store while they exist.
+
+Probably the best developed system along these lines is git. Git makes some different tradeoffs than registries. In particular it is optimised for handling many small (text) files, and serving these on a relatively small scale. There are only four types of content, commits, trees, blobs and annotated tags. Files are aggregated by into larger chunks (packfiles with indexes) in order to serve them more efficiently, and this is done dynamically to reduce network traffic. Git has well developed ways of handling metadata, which will be contrasted later. Like registries, the object model has got more complex over times, with trees now being able to point at commits not just trees or blobs. A registry is generally designed to handle larger files, with fewer links between them, and much higher volumes of traffic so the server is required to do less work. Over time there is likely to be a convergence, as there are use cases for which the git model of more aggressive file deduplication is useful (for example constrained network use cases), and we are likely to see more traffic akin to branch traversals in registries.
+


s/by into/into/

mikebrow · 2021-03-31T22:40:09Z

universal.md

+
+Probably the best developed system along these lines is git. Git makes some different tradeoffs than registries. In particular it is optimised for handling many small (text) files, and serving these on a relatively small scale. There are only four types of content, commits, trees, blobs and annotated tags. Files are aggregated by into larger chunks (packfiles with indexes) in order to serve them more efficiently, and this is done dynamically to reduce network traffic. Git has well developed ways of handling metadata, which will be contrasted later. Like registries, the object model has got more complex over times, with trees now being able to point at commits not just trees or blobs. A registry is generally designed to handle larger files, with fewer links between them, and much higher volumes of traffic so the server is required to do less work. Over time there is likely to be a convergence, as there are use cases for which the git model of more aggressive file deduplication is useful (for example constrained network use cases), and we are likely to see more traffic akin to branch traversals in registries.
+
+Registry formats started off just for a very specific container image use case. Originally there was one special document type, the image manifest, which pointed at a list of blobs, and a configuration blob. To find all the referenced blobs, you only had to track all the image manifests that had tags pointing at them, and then find all the blobs pointed at by those. Then the image index was added, which pointed at image manifests, for multi architecture selection. Technically this could also point at other image manifests, although this has not used much. A garbage collector had to traverse these additional links as well. The issue with these two formats is that they were very specific to container images, and although they have some ability to add generic metadata, it is difficult to adapt them to new types of stored data. The OCI Artifacts specification still requires objects to be defined as a configuration blob and a list of blobs. The registry may be expected to parse the configuration blob for display purposes. However there is no way to define objects that point to other objects, rather than blobs, so the format is not very generic. An [OCI artifact manifest](https://github.com/opencontainers/artifacts/pull/29) has been proposed that supports blobs and references, but even that is not fully generic, and more formats a re likely to be needed in future.


s/although this has not used much/although this has not been used much/

mikebrow · 2021-03-31T22:41:26Z

universal.md

+
+Probably the best developed system along these lines is git. Git makes some different tradeoffs than registries. In particular it is optimised for handling many small (text) files, and serving these on a relatively small scale. There are only four types of content, commits, trees, blobs and annotated tags. Files are aggregated by into larger chunks (packfiles with indexes) in order to serve them more efficiently, and this is done dynamically to reduce network traffic. Git has well developed ways of handling metadata, which will be contrasted later. Like registries, the object model has got more complex over times, with trees now being able to point at commits not just trees or blobs. A registry is generally designed to handle larger files, with fewer links between them, and much higher volumes of traffic so the server is required to do less work. Over time there is likely to be a convergence, as there are use cases for which the git model of more aggressive file deduplication is useful (for example constrained network use cases), and we are likely to see more traffic akin to branch traversals in registries.
+
+Registry formats started off just for a very specific container image use case. Originally there was one special document type, the image manifest, which pointed at a list of blobs, and a configuration blob. To find all the referenced blobs, you only had to track all the image manifests that had tags pointing at them, and then find all the blobs pointed at by those. Then the image index was added, which pointed at image manifests, for multi architecture selection. Technically this could also point at other image manifests, although this has not used much. A garbage collector had to traverse these additional links as well. The issue with these two formats is that they were very specific to container images, and although they have some ability to add generic metadata, it is difficult to adapt them to new types of stored data. The OCI Artifacts specification still requires objects to be defined as a configuration blob and a list of blobs. The registry may be expected to parse the configuration blob for display purposes. However there is no way to define objects that point to other objects, rather than blobs, so the format is not very generic. An [OCI artifact manifest](https://github.com/opencontainers/artifacts/pull/29) has been proposed that supports blobs and references, but even that is not fully generic, and more formats a re likely to be needed in future.


s/a re/are/

mikebrow · 2021-03-31T22:41:50Z

universal.md

+
+Probably the best developed system along these lines is git. Git makes some different tradeoffs than registries. In particular it is optimised for handling many small (text) files, and serving these on a relatively small scale. There are only four types of content, commits, trees, blobs and annotated tags. Files are aggregated by into larger chunks (packfiles with indexes) in order to serve them more efficiently, and this is done dynamically to reduce network traffic. Git has well developed ways of handling metadata, which will be contrasted later. Like registries, the object model has got more complex over times, with trees now being able to point at commits not just trees or blobs. A registry is generally designed to handle larger files, with fewer links between them, and much higher volumes of traffic so the server is required to do less work. Over time there is likely to be a convergence, as there are use cases for which the git model of more aggressive file deduplication is useful (for example constrained network use cases), and we are likely to see more traffic akin to branch traversals in registries.
+
+Registry formats started off just for a very specific container image use case. Originally there was one special document type, the image manifest, which pointed at a list of blobs, and a configuration blob. To find all the referenced blobs, you only had to track all the image manifests that had tags pointing at them, and then find all the blobs pointed at by those. Then the image index was added, which pointed at image manifests, for multi architecture selection. Technically this could also point at other image manifests, although this has not used much. A garbage collector had to traverse these additional links as well. The issue with these two formats is that they were very specific to container images, and although they have some ability to add generic metadata, it is difficult to adapt them to new types of stored data. The OCI Artifacts specification still requires objects to be defined as a configuration blob and a list of blobs. The registry may be expected to parse the configuration blob for display purposes. However there is no way to define objects that point to other objects, rather than blobs, so the format is not very generic. An [OCI artifact manifest](https://github.com/opencontainers/artifacts/pull/29) has been proposed that supports blobs and references, but even that is not fully generic, and more formats a re likely to be needed in future.


s/in future/in the future/

mikebrow · 2021-03-31T23:13:14Z

universal.md

+
+This is all the data that is needed to manage the content store from the registry operators side, and this can be used as the data model. It is easy to see how existing image manifests and index can be mapped down to this data model. However, it is too minimal for the client to be able to process it so we need to enhance it with metadata. This is where the design space gets much more complicated. We will have simple processing rules that let the server simply extract this data model from a format that is more suitable for client processing.
+
+The existing formats assume that one document corresponds to one version of one specification. This has caused all sorts of issues, such as how to add new forms of compression to container images in a way that new clients can get enhancements while old clients continue to work. The only partial upgrade path that has worked has been for multi arch images, where an additional layer of indirection is required for a client to make a choice of which version it supports, although here the specification is very rigid on the set of allowed choices and existing clients only worked if pointed at the single architecture image. Because the registry is a content addressed store, content negotiation does not work well either, as the client often needs to check the content hash; the only place where it has been used was on upgrading to content adressablity, and it should not be used again.


agree to the spec leaving open both transport and artifact level compression/encryption...

mikebrow · 2021-03-31T23:49:10Z

universal.md

+2. Clients rank their preference for objects and search for a match, much as happens with multi arch now. Clients will generally prefer newer formats over old ones, and their specific architecture matches over worse ones. Generic clients might have user configuration. Note that multi arch can be included in a single manifest rather than indirecting to image manifests, making fetching more efficient.
+3. A few generic object types are supported. Notably these include "Pointer", "Relation" and "Property". Pointer is simply an object that points at another one which should be dereferenced. A signed Pointer object can be used for a detached signature. Relation is a pointer, a type, and a second object which is related, used for adding metadata such as an SBOM to an existing object; again a client that is not interested in the SBOM should just follow the link to the primary object. "Property" is like a pointer but is stored within a manifest and links to a property of that manifest itself, used if you want to say store an SBOM at build time in the same object, to avoid an extra redirection.
+4. We support items in a manifest that have neither blobs nor references, so that the server side does not care about these. This allows manifests to do things like reference foreign layers or images by tag that are not actually references at all, but make the client side processing more consistent.
+5. User interface specific items, such as links to icons for the image to be displayed can be added as an additional object that the client will ignore but can be used by the registry UI only.


by link do you mean uri or pointer

mikebrow · 2021-03-31T23:49:56Z

universal.md

+
+- **`schemaVersion`** *int*
+
+  This is a REQUIRED property. The design of this document is such that this should not need to be bumped.


oh that's funny

sargun · 2021-04-06T16:10:20Z

universal.md

+From the point of view of someone describing a new type of artifact, it is flexibility of attaching metadata that matters most. You may want links to blobs and other objects, labelled with their types and use cases. Examples of things that people want to construct now are a link to another object and a bill of materials for that object, or a manifest for a piece of software for multiple architectures, all combined in a single object so there is a choice of layers to download. You might also want to add an existing format that has links in the underlying blobs but you need to include them in the manifest so that they are visible to the registry without teaching it about the underlying format. So the important thing is that there are highly flexible ways to attach metadata to describe every link and blob, and anything else that is useful for the format to include to avoid having to parse more blobs.
+
+So to meet both use cases, we should have a format that has a lot of extensible metadata but it has a very simple structure for the operator to use for management operations. Let us look at the management side data model first.
+


If we're going to redefine the "one format to rule them all", can we:

Not use JSON? (CBOR with CDDL sounds good, due to the fact it has a canonical serialization standard)

Consider using a formal schema language (a la openapi, json schema)

sargun · 2021-04-06T16:23:53Z

universal.md

+
+- **`references`** *array of objects*
+
+  An optional set of references to other artifacts, of this type or of other supported types that may themselves point to blobs or other references. Each object MUST have a descriptor. The descriptor MUST have **`digest`** and **`size`**. These will be traversed for garbage collection. A registry will generally want to reject uploads of manifests with references that point to objects that it cannot parse or interpret as valid.


An aside:

We currently talk about using these for GC. One of the other neat ways we can use these is for HTTP/2 push when someone is downloading this "top level" document.

vbatts · 2021-04-06T16:48:12Z

@justincormack might we split sentences out to their own line, please

lumjjb · 2021-05-07T18:58:34Z

Was pointed here from #15 for image encryption. I think this aligns pretty well with image encryption, especially since we do have this requirement of wanting to be able to add more key authorizations via wrapped keys after an image is already encrypted (for example, giving access to a image scanner by wrapping a key with the service's public key)... Right now, its done via a list in annotations, but this will be much easier to manage the wrapped keys.

On a totally separate note, I think the story around signing and verification may become a little more complex. Disclaimer: I haven't been on top of the latest discussions on this, so maybe @SteveLasker or someone else can educate me on the thoughts around this

My concerns around signing/verification are:

Signing a manifest would be difficult if it is designed to keep changing (if I understood the proposal correctly). So I'm guessing there would be a new abstraction at which signing an image happens - or a more involved ruleset of verification.
If objects that reference artifacts are meant to be used in influencing how an artifact is consumed, this may become messy - so I'm assuming they have minimal influence on it.

SteveLasker · 2021-05-07T19:24:49Z

Hey @lumjjb,

Signing a manifest would be difficult if it is designed to keep changing (if I understood the proposal correctly).

Are you inferring it can keep changing because a single manifest can support multiple versions of an artifact?

Due to the content-addressable storage, we're still designing around every manifest is a fixed object, with a digest. If you want add new version support, you'd push a new manifest with the previous versioned content (meaning non-encrypted) and the new versioned content, encrypted. You would sign the new digest, and associate it with an existing tag.
So, the user that pulls a specific tag: net-monitor:v1 could, over time, get encrypted content without changing the original content. A downstream client could pull the non-encrypted content, and a client that understands the newer version could opt-into the new content.

The Notary v2 proposal and I believe cosign as well, is to push a detached signature that references the artifact it's signing. You can push multiple signatures, or multiple referenced objects as below:

lumjjb · 2021-05-07T19:47:49Z

I see. Yea that is the issue i'm thinking about (but not including encryption), and I suppose the validity of a manifest would be based on some value in the metadata/SBOM that is signed, so checking is on that level instead of having snapshot signatures (like in TUF) to invalidate older signatures.

I think this clarifies my question around the manifest. Thanks @SteveLasker !

SteveLasker · 2021-05-07T21:02:44Z

validity of a manifest would be based on some value in the metadata

The Notary v2 model signs the digest of the manifest, which is why it's so important the digest doesn't change. And, you can submit multiple, independent, signatures on the same image. For instance, content promotion within or across registries.

We'll have the ability to invalidate a key or a signature, and the ability to sign a tag.

nishakm · 2021-06-03T23:00:31Z

universal.md

+
+- **`rtype`** *string*
+
+  The reference type of the component, which MUST be `blob`, `reference` or not specified if the component has neither a blob or reference link. This would be for a case where it is a component that does not have a descriptor, eg it refers to something by tag not hash so is not tracked as a reference. I am also wondering if we should add `data` for a "data URI" type reference where there would be a blob but it is inlined into the object instead, useful for small unique objects to save an external blob lookup, but the client can treat them exactly like blobs.


How would "not specified" be represented in JSON?

nishakm · 2021-06-04T14:21:03Z

universal.md

+  }
+}
+```
+If we want to add an SBOM to this image we can add a Property, like a pointer but pointing at the object itself.


I finally got to drawing this out. I am still confused about how embedding a property in the object manifest is different from creating a relation manifest. Does it have to do with upgradability?

got to drawing this out

Do you mean literally drawing something? Can I see it? 😄

Nothing fancy but helps some...

I think you would need a third box for the SBOM that both the sha256:3b0b... descriptors point to, because as shown this has an impossible cycle.

We iterated a bit on this, and here's a PR on the PR that attempted to incorporate SBoMs and some name cleanup: justincormack#1
Here are some examples of how an SBoM could be associated: https://github.com/SteveLasker/artifacts/blob/distribution-manifest/universal.md#sbom-reference-artifact

The rest of the manifest examples are at: https://github.com/SteveLasker/artifacts/tree/distribution-manifest/manifest-examples

Justin had concerns of cycles due to the collections, which is what led to the change of a single subjectManifest in #29

Sorry, I had a hard time following all the proposed manifests. I suppose part of the issue is that the OCI descriptor is a complex type itself. If y'all would like, I can make more drawings for these proposals...

If y'all would like, I can make more drawings for these proposals...

I make drawings to learn as well, so I'm sure it would be helpful to others. It would also be nice to validate that we're all on the same page re: what's actually being proposed, but don't feel pressured to if you're busy :P

cmoulliard · 2021-10-20T12:14:39Z

When is it planned to approve this spec and implement it ?

SteveLasker · 2021-10-20T13:23:29Z

Hi @cmoulliard,
Can you tell us a bit about what use cases you’re looking for?

SteveLasker · 2021-10-20T13:30:30Z

ahh, I see. The build pack reference. I’ll read through the above PR. This PR wound up getting paused as it has some challenges. We have started implementing an evolution of pr#29 at https://github.com/oras-project/artifacts-spec
The latest release has links to a distribution and oras based reference implementation

mikebrow · 2023-07-16T00:34:52Z

FYI artifacts mission is moving to opencontainers/image-spec this repo is being archived, great ideas in here btw!

mikebrow · 2023-07-17T14:02:55Z

closing for now due to pending archive action.. pls reopen if archive is not completed and/or if you believe this close to be in error

Previously at https://gist.github.com/justincormack/523dc229f0dd7b882…

1fee8b1

…edf19c60aed1581 Here for more detailed comments. Signed-off-by: Justin Cormack <justin@specialbusservice.com>

dlorenc reviewed Mar 26, 2021

View reviewed changes

jonjohnsonjr reviewed Mar 29, 2021

View reviewed changes

SteveLasker mentioned this pull request Mar 29, 2021

OCI artifact manifest, Phase 1-Reference Types #29

Closed

SteveLasker mentioned this pull request Mar 30, 2021

How To Add New Capabilities to OCI * opencontainers/tob#94

Closed

fuweid reviewed Mar 31, 2021

View reviewed changes

mikebrow reviewed Apr 1, 2021

View reviewed changes

This was referenced Apr 4, 2021

Mixing "manifests" and "layers" in an artifact? #33

Closed

add nydus image artifact SteveLasker/artifacts#3

Closed

sargun reviewed Apr 6, 2021

View reviewed changes

SteveLasker mentioned this pull request Apr 7, 2021

Pushing an image index oras-project/oras#237

Closed

dlorenc mentioned this pull request Apr 20, 2021

Proposal: Working Group for Reference Types opencontainers/tob#96

Closed

SteveLasker mentioned this pull request May 4, 2021

Added encryption mediatype doc #15

Closed

SteveLasker mentioned this pull request May 12, 2021

Artifacts: RC1 (Reference Type Support) notaryproject/specifications#68

Closed

SteveLasker mentioned this pull request May 20, 2021

Add Index support for artifact type #25

Closed

nishakm reviewed Jun 3, 2021

View reviewed changes

nishakm reviewed Jun 4, 2021

View reviewed changes

cmoulliard mentioned this pull request Oct 14, 2021

RFC: Support Dockerfiles buildpacks/rfcs#173

Merged

SteveLasker mentioned this pull request Apr 6, 2022

Add Index Support for Artifact Type opencontainers/image-spec#806

Closed

SteveLasker mentioned this pull request Apr 29, 2023

garbage-collection clarification needed opencontainers/distribution-spec#406

Open

mikebrow closed this Jul 17, 2023


		- `mediaType` string

		This field contains the `mediaType` of this document. This MUST be `application/vnd.oci.object.manifest.v1+json` for the JSON encoding or `application/vnd.oci.object.manifest.v1+jwt` for JSON web signature (RFC7515) encoding. The server needs this to return the media type to the application requesting this object.


		Now let us look at how we map various existing and proposed types to this manifest format.

		The simplest type is the Pointer that simply points to another object. This looks useless, but if it is signed it can act as a detached signature for the item that it points at.


		So to meet both use cases, we should have a format that has a lot of extensible metadata but it has a very simple structure for the operator to use for management operations. Let us look at the management side data model first.

		- `mediaType` string


		- `blobs` array of objects

		An optional set of references to blobs. Each object MUST have a descriptor. From the server point of view the only pieces that matter for the descriptor are `digest`, used to find the object in the content store, and `size` which is used to check the expected size if the server needs to fetch the object, and potentially for hash collision attack detection. The server data model does not need to know the media type, as it simply needs to know this is a blob so it does not need to track further links. The `urls` property in the descriptor, and the `annotations` are just for client side use. Note there is no seperate model for `config` here; as far as the server data model is concerned this is simply another blob. Below we will discuss how the client distinguishes the different sorts of blob. The server is also not concerned about the ordering of blobs or references, so it can store these in a relation.


		This is all the data that is needed to manage the content store from the registry operators side, and this can be used as the data model. It is easy to see how existing image manifests and index can be mapped down to this data model. However, it is too minimal for the client to be able to process it so we need to enhance it with metadata. This is where the design space gets much more complicated. We will have simple processing rules that let the server simply extract this data model from a format that is more suitable for client processing.

		The existing formats assume that one document corresponds to one version of one specification. This has caused all sorts of issues, such as how to add new forms of compression to container images in a way that new clients can get enhancements while old clients continue to work. The only partial upgrade path that has worked has been for multi arch images, where an additional layer of indirection is required for a client to make a choice of which version it supports, although here the specification is very rigid on the set of allowed choices and existing clients only worked if pointed at the single architecture image. Because the registry is a content addressed store, content negotiation does not work well either, as the client often needs to check the content hash; the only place where it has been used was on upgrading to content adressablity, and it should not be used again.


		Probably the best developed system along these lines is git. Git makes some different tradeoffs than registries. In particular it is optimised for handling many small (text) files, and serving these on a relatively small scale. There are only four types of content, commits, trees, blobs and annotated tags. Files are aggregated by into larger chunks (packfiles with indexes) in order to serve them more efficiently, and this is done dynamically to reduce network traffic. Git has well developed ways of handling metadata, which will be contrasted later. Like registries, the object model has got more complex over times, with trees now being able to point at commits not just trees or blobs. A registry is generally designed to handle larger files, with fewer links between them, and much higher volumes of traffic so the server is required to do less work. Over time there is likely to be a convergence, as there are use cases for which the git model of more aggressive file deduplication is useful (for example constrained network use cases), and we are likely to see more traffic akin to branch traversals in registries.

		Registry formats started off just for a very specific container image use case. Originally there was one special document type, the image manifest, which pointed at a list of blobs, and a configuration blob. To find all the referenced blobs, you only had to track all the image manifests that had tags pointing at them, and then find all the blobs pointed at by those. Then the image index was added, which pointed at image manifests, for multi architecture selection. Technically this could also point at other image manifests, although this has not used much. A garbage collector had to traverse these additional links as well. The issue with these two formats is that they were very specific to container images, and although they have some ability to add generic metadata, it is difficult to adapt them to new types of stored data. The OCI Artifacts specification still requires objects to be defined as a configuration blob and a list of blobs. The registry may be expected to parse the configuration blob for display purposes. However there is no way to define objects that point to other objects, rather than blobs, so the format is not very generic. An [OCI artifact manifest](https://github.com/opencontainers/artifacts/pull/29) has been proposed that supports blobs and references, but even that is not fully generic, and more formats a re likely to be needed in future.


		A REQUIRED version string for the type. Clients will normally look for an object with the latest version they understand, and fall back to older versions if they support them. Different versions may have different structures.

		- `filters` map string - string


		- `filters` map string - string

		An array of OPTIONAL keys and values for selecting among multiple object options that a client has. These are filters that the client will use to select which version of an image is most appropriate, for example selecting by architecture ro any other appropriate choice. Common keys and values should be used where possible, such as `org.oci.architecture` so that common code and documentation can be reused. These are collected into a single map so a user interface can display them without understanding the object type.


		Standard OCI descriptor. If the component is not a `blob` or `reference` this will be ignored but the client can use it, eg for a foreign layer or for some other reference type that still wants to verify a hash, such as a bridge to another system.

		- `ctype` string


		The `components` of an object correspond to individual parts, potentially of multiple types, for example an image has a config and some layers. The parts that are needed to extract generic data are specified but again any type specific data can be added.

		- `rtype` string


		An array of OPTIONAL keys and values for selecting among multiple object options that a client has. These are filters that the client will use to select which version of an image is most appropriate, for example selecting by architecture ro any other appropriate choice. Common keys and values should be used where possible, such as `org.oci.architecture` so that common code and documentation can be reused. These are collected into a single map so a user interface can display them without understanding the object type.

		- `components` array of components

		A registry is a data store for content addressed data. At the lowest level it is just a large key (hash) to value (document) store, but one of the things we have learned about data stores is that they are more useful if they allow for structured data (a Merkle tree, technically a dag), with data being allowed to point to (hashes of) other data items. The complexity added by this is that if the data store needs to follow the links in documents it needs to know how to parse them. The primary use case for this parsing is for garbage collection: the usual storage model allows for an object that is not referenced internally or from a tag (a generic name that can be given to items in the store) may be garbage collected. Without this it is difficult to remove any items from the store. Tags exist to give human friendly names and to anchor items into the store while they exist.

		Probably the best developed system along these lines is git. Git makes some different tradeoffs than registries. In particular it is optimised for handling many small (text) files, and serving these on a relatively small scale. There are only four types of content, commits, trees, blobs and annotated tags. Files are aggregated by into larger chunks (packfiles with indexes) in order to serve them more efficiently, and this is done dynamically to reduce network traffic. Git has well developed ways of handling metadata, which will be contrasted later. Like registries, the object model has got more complex over times, with trees now being able to point at commits not just trees or blobs. A registry is generally designed to handle larger files, with fewer links between them, and much higher volumes of traffic so the server is required to do less work. Over time there is likely to be a convergence, as there are use cases for which the git model of more aggressive file deduplication is useful (for example constrained network use cases), and we are likely to see more traffic akin to branch traversals in registries.


		- `schemaVersion` int

		This is a REQUIRED property. The design of this document is such that this should not need to be bumped.

		From the point of view of someone describing a new type of artifact, it is flexibility of attaching metadata that matters most. You may want links to blobs and other objects, labelled with their types and use cases. Examples of things that people want to construct now are a link to another object and a bill of materials for that object, or a manifest for a piece of software for multiple architectures, all combined in a single object so there is a choice of layers to download. You might also want to add an existing format that has links in the underlying blobs but you need to include them in the manifest so that they are visible to the registry without teaching it about the underlying format. So the important thing is that there are highly flexible ways to attach metadata to describe every link and blob, and anything else that is useful for the format to include to avoid having to parse more blobs.

		So to meet both use cases, we should have a format that has a lot of extensible metadata but it has a very simple structure for the operator to use for management operations. Let us look at the management side data model first.


		- `references` array of objects

		An optional set of references to other artifacts, of this type or of other supported types that may themselves point to blobs or other references. Each object MUST have a descriptor. The descriptor MUST have `digest` and `size`. These will be traversed for garbage collection. A registry will generally want to reject uploads of manifests with references that point to objects that it cannot parse or interpret as valid.


		- `rtype` string

		The reference type of the component, which MUST be `blob`, `reference` or not specified if the component has neither a blob or reference link. This would be for a case where it is a component that does not have a descriptor, eg it refers to something by tag not hash so is not tracked as a reference. I am also wondering if we should add `data` for a "data URI" type reference where there would be a blob but it is inlined into the object instead, useful for small unique objects to save an external blob lookup, but the client can treat them exactly like blobs.

WIP generic object spec #37

WIP generic object spec #37

Conversation

justincormack commented Mar 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justincormack Mar 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonjohnsonjr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SteveLasker commented Mar 30, 2021

Image toolchain updates:

fuweid left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikebrow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbatts commented Apr 6, 2021

lumjjb commented May 7, 2021

SteveLasker commented May 7, 2021

lumjjb commented May 7, 2021

SteveLasker commented May 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmoulliard commented Oct 20, 2021

SteveLasker commented Oct 20, 2021

SteveLasker commented Oct 20, 2021

mikebrow commented Jul 16, 2023

mikebrow commented Jul 17, 2023

justincormack Mar 29, 2021 •

edited

Loading