Rook is designed to provide orchestration and management for distributed storage systems to run in cloud-native environments, however only Ceph is currently supported. Rook could be beneficial to a wider audience if support for orchestrating more storage backends was implemented. For instance, in addition to the existing support for Ceph, we should consider support for Minio as the next storage backend supported by Rook. Minio is a popular distributed object storage solution that could benefit from a custom controller to orchestrate, configure and manage it in cloud-native environments.
This design document aims to describe how this can be accomplished through common storage abstractions and custom controllers for more types of storage providers.
- To design and describe how Rook will support multiple storage backends
- Consider options and recommend architecture for hosting multiple controllers/operators
- Provide basic guidance on how new storage controllers would integrate with Rook
- Define common abstractions and types across storage providers
- Custom Resources in Kubernetes
- Kubernetes API Conventions
- Use CRDs Whenever Possible
- Writing controllers dev guide
- A Deep Drive Into Kubernetes Controllers
- Metacontroller: Kubernetes extension that enables lightweight lambda (functional) controllers
To provide a native experience in Kubernetes, Rook has so far defined new storage types with Custom Resource Definitions (CRDs) and implemented the operator pattern to manage the instances of those types that users create. When deciding how to expand the experience provided by Rook, we should reevaluate the most current options for extending Kubernetes in order to be confident in our architecture. This is especially important because changing the architecture later on will be more difficult the more storage types we have integrated into Rook.
API server aggregation is the most feature rich and complete extension option for Kubernetes, but it is also the most complicated to deploy and manage. Basically, it allows you to extend the Kubernetes API with your own API server that behaves just like the core API server does. This approach offers a complete and powerful feature set such as rich validation, API versioning, custom business logic, etc. However, using an extension apiserver has some disruptive drawbacks:
- etcd must also be deployed for storage of its API objects, increasing the complexity and adding another point of failure. This can be avoided with using a CRD to store its API objects but that is awkward and exposes internal storage to the user.
- development cost would be significant to get the deployment working reliably in supported Rook environments and to port our existing CRDs to extension types.
- breaking change for Rook users with no clear migration path, which would be very disruptive to our current user base.
CRDs are what Rook is already using to extend Kubernetes. They are a limited extension mechanism that allows the definition of custom types, but lacks the rich features of API aggregation. For example, validation of a users CRD is only at the schema level and simple property value checks that are available via the OpenAPI v3 schema. Also, there is currently no versioning (conversion) support. However, CRDs are being actively supported by the community and more features are being added to CRDs going forward (e.g., a versioning proposal). Finally, CRDs are not a breaking change for Rook users since they are already in use today.
The controllers are the entities that will perform orchestration and deployment tasks to ensure the users desired state is made a reality within their cluster. There are a few options for deploying and hosting the controllers that will perform this work, as explained in the following sections.
Multiple controllers could be deployed within a single process.
For instance, Rook could run one controller per domain of expertise, e.g., ceph-controller, minio-controller, etc.
Controllers would all watch the same custom resource types via a SharedInformer
and respond to events via a WorkQueue
for efficiency and reduced burden on core apiserver.
Even though all controllers are watching, only the applicable controller responds to and handles an event.
For example, the user runs kubectl create cluster.yaml
to create a cluster.rook.io
instance that has Ceph specific properties.
All controllers will receive the created event via the SharedInformer
, but only the Ceph controller will queue and handle it.
We can consider only loading the controllers the user specifically asks for, perhaps via an environment variable in operator.yaml
.
Note that this architecture can be used with either API aggregation or CRDs.
- Slightly easier developer burden for new storage backends since there is no need to create a new deployment to host their controller.
- Less resource burden on K8s cluster since watchers/caches are being shared.
- Easier migration path to API aggregation in the future, if CRDs usage is continued now.
- All controllers must use same base image since they are all running in the same process.
- If a controller needs to access a backend specific tool then they will have to schedule a job that invokes the appropriate image. This is similar to exec’ing a new process but at the cluster level.
- Note this only applies to the controller, not to the backend's daemons. Those daemons will be running a backend specific image and can directly
exec
to their tools.
- Note this only applies to the controller, not to the backend's daemons. Those daemons will be running a backend specific image and can directly
Each storage backend could have their own operator pod that hosts only that backend's controller, e.g., ceph-operator.yaml
, minio-operator.yaml
, etc.
The user would decide which operators to deploy based on what storage they want to use.
Each operator pod would watch the same custom resource types with their own individual watchers.
- Each operator can use their own image, meaning they have direct access (through
exec
) to their backend specific tools. - Runtime isolation, one flaky backend does not impact or cause downtime for other backends.
- Privilege isolation, each backend could define their own service account and RBAC that is scoped to just their needs.
- More difficult migration path to API aggregation in the future.
- Potentially more resource usage and load on Kubernetes API since watchers will not be shared, but this is likely not an issue since users will deploy only the operator they need.
- Slightly more developer burden as all backends have to write their own deployment/host to manage their individual pod.
For storage backends that fit the patterns that Metacontroller supports (CompositeController
and DecoratorController
), this could be an option to incorporate into Rook.
Basically, a storage backend defines their custom types and the parent/child relationships between them.
The metacontroller handles all the K8s API interactions and regularly calls into storage backend defined “hooks”.
The storage backend is given JSON representing the current state in K8s types and then returns JSON defining in K8s types what the desired state should be.
The metacontroller then makes that desired state a reality via the K8s API.
This pattern does allow for fairly complicated stateful apps (e.g. Vitess) that have well defined parent/children hierarchies, and can allow for the storage backend operator to perform “imperative” operations to manipulate cluster state by launching Jobs.
CRDs with an operator pod per backend: This will not be a breaking change for our current users and does not come with the deployment complexity of API aggregation. It would provide each backend's operator the freedom to easily invoke their own tools that are packaged in their own specific image, avoiding unnecessary image bloat. It also provides both resource and privilege isolation for each backend. We would accept the burden of limited CRD functionality (which is improving in the future though).
We should also consider integrating metacontroller's functionality for storage backends that are compatible and can benefit from its patterns. Each storage backend can make this decision independently.
Custom resources in Kubernetes use the following naming and versioning convention:
- Group: A collection of several related types that are versioned together and can be enabled/disabled as a unit in the API (e.g.,
ceph.rook.io
) - Version: The API version of the group (e.g.,
v1alpha1
) - Kind: The specific type within the API group (e.g.,
cluster
)
Putting this together with an example, the cluster
kind from the ceph.rook.io
API group with a version of v1alpha1
would be referred to in full as cluster.ceph.rook.io/v1alpha1
.
Versioning of custom resources defined by Rook is important, and we should carefully consider a design that allows resources to be versioned in a sensible way. Let's first review some properties of Rook's resources and versioning scheme that are desirable and we should aim to satisfy with this design:
- Storage backends should be independently versioned, so their maturity can be properly conveyed. For example, the initial implementation of a new storage backend should not be forced to start at a stable
v1
version. - CRDs should mostly be defined only for resources that can be instantiated. If the user can't create an instance of the resource, then it's likely better off as a
*Spec
type that can potentially be reused across many types. - Reuse of common types is a very good thing since it unifies the experience across storage types and it reduces the duplication of effort and code. Commonality and sharing of types and implementations is important and is another way Rook provides value to the greater storage community beyond the operators that it implements.
Note that it is not a goal to define a common abstraction that applies to the top level storage backends themselves, for instance a single Cluster
type that covers both Ceph and Minio.
We should not be trying to force each backend to look the same to storage admins, but instead we should focus on providing the common abstractions and implementations that storage providers can build on top of.
This idea will become more clear in the following sections of this document.
With the intent for Rook's resources to fulfill the desirable properties mentioned above, we propose the following API groups:
rook.io
: common abstractions and implementations, in the form of*Spec
types, that have use across multiple storage backends and types. For example, storage, network information, placement, and resource usage.ceph.rook.io
: Ceph specificCluster
CRD type that the user can instantiate to have the Ceph controller deploy a Ceph cluster or Ceph resources for them. This Ceph specific API group allows Ceph types to be versioned independently.minio.rook.io
: Similar, but for Minio.cockroachdb.rook.io
: Similar, but for CockroachDB.nexenta.rook.io
: Similar, but for Nexenta.
With this approach, the user experience to create a cluster would look like the following in yaml
, where they are declaring and configuring a Ceph specific CRD type (from the ceph.rook.io
API group), but with many common *Spec
types that provide configuration and logic that is reusable across storage providers.
ceph-cluster.yaml
:
apiVersion: ceph.rook.io/v1
kind: Cluster
spec:
mon:
count: 3
allowMultiplePerNode: false
network:
placement:
resources:
storage:
deviceFilter: "^sd."
config:
storeType: "bluestore"
databaseSizeMB: "1024"
Our golang
strongly typed definitions would look like the following, where the Ceph specific Cluster
CRD type has common *Spec
fields.
types.go
:
package v1alpha1 // "github.com/rook/rook/pkg/apis/ceph.rook.io/v1alpha1"
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
rook "github.com/rook/rook/pkg/apis/rook.io/v1"
)
type Cluster struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata"`
Spec ClusterSpec `json:"spec"`
Status ClusterStatus `json:"status"`
}
type ClusterSpec struct {
Storage rook.StorageScopeSpec `json:"storage"`
Network rook.NetworkSpec `json:"network"`
Placement rook.PlacementSpec `json:"placement"`
Resources rook.ResourceSpec `json:"resources"`
Mon rook.MonSpec `json:"mon"`
}
Similar to how we will not try to force commonality by defining a single Cluster
type across all backends,
we will also not define single types that define the deployment and configuration of a backend's storage concepts.
For example, both Ceph and Minio present object storage.
Both Ceph and Nexenta present shared file systems.
However, the implementation details for what components and configuration comprise these storage presentations is very provider specific.
Therefore, it is not reasonable to define a common CRD that attempts to normalize how all providers deploy their object or file system presentations.
Any commonality that can be reasonably achieved should be in the form of reusable *Spec
types and their associated libraries.
Each provider can make a decision about how to expose their storage concepts. They could be defined as instantiable top level CRDs or they could be defined as collections underneath the top level storage provider CRD. Below are terse examples to demonstrate the two options.
Top-level CRDs:
apiVersion: ceph.rook.io/v1
kind: Cluster
spec:
...
---
apiVersion: ceph.rook.io/v1
kind: Pool
spec:
...
---
apiVersion: ceph.rook.io/v1
kind: Filesystem
spec:
...
Collections under storage provider CRD:
apiVersion: ceph.rook.io/v1
kind: Cluster
spec:
pools:
- name: replicaPool
replicated:
size: 1
- name: ecPool
erasureCoded:
dataChunks: 2
codingChunks: 1
filesystems:
- name: filesystem1
metadataServer:
activeCount: 1
The StorageScopeSpec
type defines the boundaries or "scope" of the resources that comprise the backing storage substrate for a cluster.
This could be devices, filters, directories, nodes, persistent volumes, and others.
There are user requested means of selecting storage that Rook doesn't currently support that could be included in this type, such as the ability to select a device by path instead of by name, e.g. /dev/disk/by-id/
.
Also, wildcards/patterns/globbing should be supported on multiple resource types, removing the need for the current useAllNodes
and useAllDevices
boolean fields.
By encapsulating this concept as its own type, it can be reused within other custom resources of Rook. For instance, this would enable Rook to support storage of other types that could benefit from orchestration in cloud-native environments beyond distributed storage systems.
StorageScopeSpec
could also provide an abstraction from such details as device name changes across reboots.
Most storage backends have a need to specify configuration options at the node and device level.
Since the StorageScopeSpec
type is already defining a node/device hierarchy for the cluster, it would be desirable for storage backends to include their configuration options within this same hierarchy, as opposed to having to repeat the hierarchy again elsewhere in the spec.
However, this isn't completely straight forward because the StorageScopeSpec
type is a common abstraction and does not have knowledge of specific storage backends.
A solution for this would be to allow backend specific properties to be defined inline within a StorageScopeSpec
as key/value pairs.
This allows for arbitrary backend properties to be inserted at the node and device level while still reusing the single StorageScopeSpec abstraction, but it means that during deserialization these properties are not strongly typed.
They would be deserialized into a golang map[string]string
.
However, an operator with knowledge of its specific backend's properties could then take that map and deserialize it into a strong type.
The yaml would like something like this:
nodes:
- name: nodeA
config:
storeType: "bluestore"
devices:
- name: "sda"
config:
storeType: "filestore"
Note how the Ceph specific properties at the node and device level are string key/values and would be deserialized that way instead of to strong types. For example, this is what the golang struct would look like:
type StorageScopeSpec struct {
Nodes []Node `json:"nodes"`
}
type Node struct {
Name string `json:"name"`
Config map[string]string `json:"config"`
Devices []Device `json:"devices"`
}
type Device struct {
Name string `json:"name"`
Config map[string]string `json:"config"`
}
After Kubernetes has done the general deserialization of the StorageScopeSpec
into a strong type with weakly typed maps of backend specific config properties, the Ceph operator could easily convert this map into a strong config type that is has knowledge of.
Other backend operators could do a similar thing for their node/device level config.
As previously mentioned, the rook.io
API group will also define some other useful *Spec
types:
PlacementSpec
: Defines placement requirements for components of the storage provider, such as node and pod affinity. This is similar to the existing Ceph focusedPlacementSpec
, but in a generic way that is reusable by all storage providers. APlacementSpec
will essentially be a map of placement information structs that are indexed by component name.NetworkSpec
: Defines the network configuration for the storage provider, such ashostNetwork
.ResourceSpec
: Defines the resource usage of the provider, allowing limits on CPU and memory, similar to the existing Ceph focusedResourceSpec
.
Rook and the greater community would also benefit from additional types and abstractions. We should work on defining those further, but it is out of scope for this design document that is focusing on support for multiple storage backends. Some potential ideas for additional types to support in Rook:
- Snapshots, backup and policy
- Quality of Service (QoS), resource consumption (I/O and storage limits)
As more storage backends are integrated into Rook, it is preferable that all source code lives within the single rook/rook
repository.
This has a number of benefits such as easier sharing of build logic, developer iteration when updating shared code, and readability of the full source.
Multiple container images can easily be built from the single source repository, similar to how rook/rook
and rook/toolbox
are currently built from the same repository.
rook/rook
image: defines all custom resource types, generated clients, general cluster discovery information (disks), and any storage operators that do not have special tool dependencies.- backend specific images: to avoid image bloat in the main
rook/rook
image, each backend will have their own image that contains all of their daemons and tools. These images will be used for the various daemons/components of each backend, e.g.rook/ceph
,rook/minio
, etc.
The current rook/rook
repository layout appears to be sufficiently factored to enable multiple storage backend support.
Some additional directories will be added to support new API versions, new custom resource types, and new storage backends.
A source code layout that includes these new additions is shown below, annotated with comments about the use of each important directory:
- cmd # binaries with main entry points
- rook # main command entry points for operators and daemons
- ceph
- minio
- cockroachdb
- rookflex
- pkg
- apis
- rook.io # rook.io API group of common types, additional groups would be sibling dirs
- v1alpha1 # existing version of alpha Rook API
- v1alpha2 # new version of alpha Rook API that includes new types
- ceph.rook.io # ceph specific specs for cluster, file, object
- v1alpha1
- minio.rook.io # minio specific specs for cluster, object
- v1alpha1
- cockroachdb.rook.io # cockroachdb specific specs
- v1alpha1
- client
- clientset # generated strongly typed client code to access Rook APIs
- daemon # daemons for each storage backend
- ceph
- minio
- cockroachdb
- operator # all orchestration logic and custom controllers for each storage backend
- ceph
- cluster
- file
- object
- pool
- minio
- cockroachdb
- Rook will enable storage providers to integrate their solutions with cloud-native environments by providing a framework of common abstractions and implementations that helps providers efficiently build reliable and well tested storage controllers.
StorageScopeSpec
,PlacementSpec
,ResourceSpec
,NetworkSpec
, etc.
- Each storage provider will be versioned independently with its own API Group (e.g.,
ceph.rook.io
) and its own instantiable CRD type(s). - Each storage provider will have its own operator pod that performs orchestration and management of the storage resources. This operator will use a provider specific container image with any special tools needed.
This section contains concrete examples of storage clusters as a user would define them using yaml. In addition to distributed storage clusters, we will be considering support for additional storage types in the near future.
ceph-cluster.yaml
:
apiVersion: ceph.rook.io/v1alpha1
kind: Cluster
metadata:
name: ceph
namespace: rook-ceph
spec:
mon:
count: 3
allowMultiplePerNode: false
network:
hostNetwork: false
placement:
- name: "mon"
nodeAffinity:
podAffinity:
podAntiAffinity:
tolerations:
resources:
- name: osd
limits:
cpu: "500m"
memory: "1024Mi"
requests:
cpu: "500m"
memory: "1024Mi"
storage:
deviceFilter: "^sd."
location:
config:
storeConfig:
storeType: bluestore
databaseSizeMB: "1024"
metadataDevice: nvme01
directories:
- path: /rook/storage-dir
nodes:
- name: "nodeA"
directories:
- path: "/rook/storage-dir"
config: # ceph specific config at the directory level via key/value pairs
storeType: "filestore"
- name: "nodeB"
devices:
- name: "vdx"
- fullpath: "/dev/disk/by-id/abc-123"
- name: "machine*" # wild cards are supported
volumeClaimTemplates:
- metadata:
name: my-pvc-template
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "my-storage-class"
resources:
requests:
storage: "1Gi"
minio-cluster.yaml
:
apiVersion: minio.rook.io/v1alpha1
kind: ObjectStore
metadata:
name: minio
namespace: rook-minio
spec:
mode: distributed
accessKey: AKIAIOSFODNN7EXAMPLE
secretKey: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
network:
hostNetwork: false
placement:
resources:
storage:
config:
nodeCount: 4 # use 4 nodes in the cluster to host storage daemons