Troubleshooting StatefulSet and Persistent Volume Issues in Kubernetes After Cloud Migration (4)
Experienced DevOps Engineer with expertise in CI/CD automation, cloud infrastructure, Kubernetes, and GitOps. Provisioned a Jenkins server on AWS EC2 for automated deployments, integrating Terraform to provision VPCs and EKS clusters. Configured a Jump Server for secure Kubernetes access and implemented ArgoCD for GitOps-driven deployments. Integrated SonarQube for static code analysis and enforced quality gates in Jenkins pipelines. Built AWS ECR repositories and automated Docker image management. Ensured security by managing secrets in Jenkins Credentials Manager and implementing IAM policies for AWS resources. Configured Kubernetes Ingress via ArgoCD and deployed MongoDB with persistence strategies. Designed multi-branch Jenkins pipelines for different environments. Installed Prometheus and Grafana for monitoring with automated alerts. Optimized costs using AWS CloudWatch and Lambda for unused resource cleanup. Ensured end-to-end automation, security, and observability.
TLDR: This blog post discusses common issues faced by DevOps engineers when migrating StatefulSets and Persistent Volumes across different cloud platforms. It provides a detailed troubleshooting guide, focusing on the importance of storage classes and the use of CSI drivers for external storage solutions.
In this fourth episode of the Kubernetes Troubleshooting Zero to Hero series, we delve into real-time issues that DevOps engineers encounter with StatefulSets and Persistent Volumes. This post will cover essential concepts such as StatefulSets, Persistent Volumes, Persistent Volume Claims, Storage Classes, and Container Storage Interface (CSI) drivers. We aim to provide a comprehensive guide on troubleshooting and fixing these issues effectively.
Understanding the Issue
A developer has reported a problem to the DevOps engineering team via a JIRA ticket. The issue arises when a StatefulSet application, which functions correctly on AWS EKS, fails to deploy on other cloud platforms like AKS (Azure Kubernetes Service) or GKE (Google Kubernetes Engine), as well as local clusters such as Minikube. The developer seeks assistance in identifying the root cause of this deployment failure.
The Role of DevOps Engineers
DevOps engineers are responsible for managing Kubernetes resources and often share YAML manifests for StatefulSets or other Kubernetes resources. When developers encounter issues, they typically create JIRA tickets to seek help from the DevOps team.
Setting Up the Environment
To troubleshoot the issue, we will use a Minikube cluster for practice. The example involves a simple StatefulSet YAML file for an Nginx web application, which includes three replicas and a Persistent Volume Claim (PVC) template using the EBS storage class.
Applying the Configuration
After applying the configuration using the command kubectl apply -f sample-statefulset.yaml, we check the status of the StatefulSet with kubectl get statefulset. The output indicates that while the StatefulSet is created, none of the pods are running. This prompts further investigation.
Troubleshooting Steps
Checking Pod Status
Using kubectl get pods, we find that only one pod is present, and it is in a pending state. To understand why, we describe the pod with kubectl describe pod web-0. The warning indicates that the pod has unbound immediate Persistent Volume Claims, preventing it from being scheduled.
Understanding StatefulSet Behavior
StatefulSets have a unique behavior compared to Deployments. In a StatefulSet, the second replica will only be created if the first one is successfully scheduled. Since the first pod is pending, the second and third replicas do not get created. This behavior is crucial for applications like databases, where the order of pod creation matters.
Analyzing the Persistent Volume Claim
The StatefulSet requests a Persistent Volume through a PVC, which specifies the required storage class. In our case, the PVC requests 1 GB of storage using the EBS storage class. However, when deploying on Minikube, we need to ensure that the correct storage class is specified.
The Importance of Storage Classes
When deploying on different cloud platforms, it is essential to use the appropriate storage class. For example, on AWS EKS, the storage class might be EBS, while on Minikube, the only available storage class is typically "standard". To resolve the issue, we need to modify the StatefulSet YAML to use the standard storage class instead of EBS.
Modifying the YAML Configuration
We delete the existing StatefulSet with kubectl delete -f sample-statefulset.yaml and update the YAML file to change the storage class from EBS to standard. After applying the updated configuration, we monitor the pod status again.
Deleting Persistent Volume Claims
Sometimes, deleting the StatefulSet does not automatically delete the associated Persistent Volume Claims. We must explicitly delete these PVCs using kubectl get pvc and kubectl delete pvc <pvc-name>. After ensuring that all PVCs are deleted, we reapply the updated StatefulSet configuration.
Successful Deployment
Once the correct storage class is specified and the PVCs are cleared, we observe that the first pod transitions to a running state. As expected, the second and third pods are created sequentially, demonstrating the StatefulSet's behavior.
External Storage Solutions and CSI Drivers
If developers wish to use external storage services not natively supported by Kubernetes, they can utilize CSI drivers. These drivers act as provisioners for external storage solutions, allowing Kubernetes to manage Persistent Volumes from various storage providers.
Installing CSI Drivers
To use an external storage service, such as NetApp, developers must install the corresponding CSI driver on their Kubernetes cluster. This driver will facilitate the creation of Persistent Volumes based on the specified storage class in the StatefulSet.
Conclusion
In this post, we explored how to troubleshoot issues related to StatefulSets and Persistent Volumes when migrating applications across different cloud platforms. We emphasized the importance of using the correct storage classes and the role of CSI drivers for external storage solutions. Understanding these concepts is crucial for DevOps engineers to ensure smooth deployments in diverse environments. Thank you for following along, and we look forward to seeing you in the next episode of our series.