Kubernetes 1.27: StatefulSet Start Ordinal Simplifies Migration
Author: Peter Schuurman (Google)
Kubernetes v1.26 introduced a new, alpha-level feature for StatefulSets that controls the ordinal numbering of Pod replicas. As of Kubernetes v1.27, this feature is now beta. Ordinals can start from arbitrary non-negative numbers. This blog post will discuss how this feature can be used.
Background
StatefulSets ordinals provide sequential identities for pod replicas. When using
OrderedReady
Pod management
Pods are created from ordinal index 0
up to N-1
.
With Kubernetes today, orchestrating a StatefulSet migration across clusters is challenging. Backup and restore solutions exist, but these require the application to be scaled down to zero replicas prior to migration. In today's fully connected world, even planned application downtime may not allow you to meet your business goals. You could use Cascading Delete or On Delete to migrate individual pods, however this is error prone and tedious to manage. You lose the self-healing benefit of the StatefulSet controller when your Pods fail or are evicted.
Kubernetes v1.26 enables a StatefulSet to be responsible for a range of ordinals within a range {0..N-1} (the ordinals 0, 1, ... up to N-1). With it, you can scale down a range {0..k-1} in a source cluster, and scale up the complementary range {k..N-1} in a destination cluster, while maintaining application availability. This enables you to retain at most one semantics (meaning there is at most one Pod with a given identity running in a StatefulSet) and Rolling Update behavior when orchestrating a migration across clusters.
Why would I want to use this feature?
Say you're running your StatefulSet in one cluster, and need to migrate it out to a different cluster. There are many reasons why you would need to do this:
- Scalability: Your StatefulSet has scaled too large for your cluster, and has started to disrupt the quality of service for other workloads in your cluster.
- Isolation: You're running a StatefulSet in a cluster that is accessed by multiple users, and namespace isolation isn't sufficient.
- Cluster Configuration: You want to move your StatefulSet to a different cluster to use some environment that is not available on your current cluster.
- Control Plane Upgrades: You want to move your StatefulSet to a cluster running an upgraded control plane, and can't handle the risk or downtime of in-place control plane upgrades.
How do I use it?
Enable the StatefulSetStartOrdinal
feature gate on a cluster, and create a
StatefulSet with a customized .spec.ordinals.start
.
Try it out
In this demo, I'll use the new mechanism to migrate a StatefulSet from one Kubernetes cluster to another. The redis-cluster Bitnami Helm chart will be used to install Redis.
Tools Required:
Pre-requisites
To do this, I need two Kubernetes clusters that can both access common
networking and storage; I've named my clusters source
and destination
.
Specifically, I need:
- The
StatefulSetStartOrdinal
feature gate enabled on both clusters. - Client configuration for
kubectl
that lets me access both clusters as an administrator. - The same
StorageClass
installed on both clusters, and set as the default StorageClass for both clusters. ThisStorageClass
should provision underlying storage that is accessible from either or both clusters. - A flat network topology that allows for pods to send and receive packets to and from Pods in either clusters. If you are creating clusters on a cloud provider, this configuration may be called private cloud or private network.
-
Create a demo namespace on both clusters:
kubectl create ns kep-3335
-
Deploy a Redis cluster with six replicas in the source cluster:
helm repo add bitnami https://charts.bitnami.com/bitnami helm install redis --namespace kep-3335 \ bitnami/redis-cluster \ --set persistence.size=1Gi \ --set cluster.nodes=6
-
Check the replication status in the source cluster:
kubectl exec -it redis-redis-cluster-0 -- /bin/bash -c \ "redis-cli -c -h redis-redis-cluster -a $(kubectl get secret redis-redis-cluster -o jsonpath="{.data.redis-password}" | base64 -d) CLUSTER NODES;"
2ce30362c188aabc06f3eee5d92892d95b1da5c3 10.104.0.14:6379@16379 myself,master - 0 1669764411000 3 connected 10923-16383 7743661f60b6b17b5c71d083260419588b4f2451 10.104.0.16:6379@16379 slave 2ce30362c188aabc06f3eee5d92892d95b1da5c3 0 1669764410000 3 connected 961f35e37c4eea507cfe12f96e3bfd694b9c21d4 10.104.0.18:6379@16379 slave a8765caed08f3e185cef22bd09edf409dc2bcc61 0 1669764411000 1 connected 7136e37d8864db983f334b85d2b094be47c830e5 10.104.0.15:6379@16379 slave 2cff613d763b22c180cd40668da8e452edef3fc8 0 1669764412595 2 connected a8765caed08f3e185cef22bd09edf409dc2bcc61 10.104.0.19:6379@16379 master - 0 1669764411592 1 connected 0-5460 2cff613d763b22c180cd40668da8e452edef3fc8 10.104.0.17:6379@16379 master - 0 1669764410000 2 connected 5461-10922
-
Deploy a Redis cluster with zero replicas in the destination cluster:
helm install redis --namespace kep-3335 \ bitnami/redis-cluster \ --set persistence.size=1Gi \ --set cluster.nodes=0 \ --set redis.extraEnvVars\[0\].name=REDIS_NODES,redis.extraEnvVars\[0\].value="redis-redis-cluster-headless.kep-3335.svc.cluster.local" \ --set existingSecret=redis-redis-cluster
-
Scale down the
redis-redis-cluster
StatefulSet in the source cluster by 1, to remove the replicaredis-redis-cluster-5
:kubectl patch sts redis-redis-cluster -p '{"spec": {"replicas": 5}}'
-
Migrate dependencies from the source cluster to the destination cluster:
The following commands copy resources from
source
todestionation
. Details that are not relevant indestination
cluster are removed (eg:uid
,resourceVersion
,status
).Steps for the source cluster
Note: If using a
StorageClass
withreclaimPolicy: Delete
configured, you should patch the PVs insource
withreclaimPolicy: Retain
prior to deletion to retain the underlying storage used indestination
. See Change the Reclaim Policy of a PersistentVolume for more details.kubectl get pvc redis-data-redis-redis-cluster-5 -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion, .metadata.annotations, .metadata.finalizers, .status)' > /tmp/pvc-redis-data-redis-redis-cluster-5.yaml kubectl get pv $(yq '.spec.volumeName' /tmp/pvc-redis-data-redis-redis-cluster-5.yaml) -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion, .metadata.annotations, .metadata.finalizers, .spec.claimRef, .status)' > /tmp/pv-redis-data-redis-redis-cluster-5.yaml kubectl get secret redis-redis-cluster -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion)' > /tmp/secret-redis-redis-cluster.yaml
Steps for the destination cluster
Note: For the PV/PVC, this procedure only works if the underlying storage system that your PVs use can support being copied into
destination
. Storage that is associated with a specific node or topology may not be supported. Additionally, some storage systems may store addtional metadata about volumes outside of a PV object, and may require a more specialized sequence to import a volume.kubectl create -f /tmp/pv-redis-data-redis-redis-cluster-5.yaml kubectl create -f /tmp/pvc-redis-data-redis-redis-cluster-5.yaml kubectl create -f /tmp/secret-redis-redis-cluster.yaml
-
Scale up the
redis-redis-cluster
StatefulSet in the destination cluster by 1, with a start ordinal of 5:kubectl patch sts redis-redis-cluster -p '{"spec": {"ordinals": {"start": 5}, "replicas": 1}}'
-
Check the replication status in the destination cluster:
kubectl exec -it redis-redis-cluster-5 -- /bin/bash -c \ "redis-cli -c -h redis-redis-cluster -a $(kubectl get secret redis-redis-cluster -o jsonpath="{.data.redis-password}" | base64 -d) CLUSTER NODES;"
I should see that the new replica (labeled
myself
) has joined the Redis cluster (the IP address belongs to a different CIDR block than the replicas in the source cluster).2cff613d763b22c180cd40668da8e452edef3fc8 10.104.0.17:6379@16379 master - 0 1669766684000 2 connected 5461-10922 7136e37d8864db983f334b85d2b094be47c830e5 10.108.0.22:6379@16379 myself,slave 2cff613d763b22c180cd40668da8e452edef3fc8 0 1669766685609 2 connected 2ce30362c188aabc06f3eee5d92892d95b1da5c3 10.104.0.14:6379@16379 master - 0 1669766684000 3 connected 10923-16383 961f35e37c4eea507cfe12f96e3bfd694b9c21d4 10.104.0.18:6379@16379 slave a8765caed08f3e185cef22bd09edf409dc2bcc61 0 1669766683600 1 connected a8765caed08f3e185cef22bd09edf409dc2bcc61 10.104.0.19:6379@16379 master - 0 1669766685000 1 connected 0-5460 7743661f60b6b17b5c71d083260419588b4f2451 10.104.0.16:6379@16379 slave 2ce30362c188aabc06f3eee5d92892d95b1da5c3 0 1669766686613 3 connected
-
Repeat steps #5 to #7 for the remainder of the replicas, until the Redis StatefulSet in the source cluster is scaled to 0, and the Redis StatefulSet in the destination cluster is healthy with 6 total replicas.
What's Next?
This feature provides a building block for a StatefulSet to be split up across clusters, but does not prescribe the mechanism as to how the StatefulSet should be migrated. Migration requires coordination of StatefulSet replicas, along with orchestration of the storage and network layer. This is dependent on the storage and connectivity requirements of the application installed by the StatefulSet. Additionally, many StatefulSets are managed by operators, which adds another layer of complexity to migration.
If you're interested in building enhancements to make these processes easier, get involved with SIG Multicluster to contribute!