Kubernetes troubleshooting is the process of identifying, diagnosing, and resolving issues in kubernetes clusters, nodes, pods, or containers.
Error 1: Kubernetes Node Not Ready
Root Cause: When a worker node shuts down or crashes, all stateful pods that reside on it become unavailable, and the node status appears as NotReady.
If a node has a NotReady status for over five minutes (by default), Kubernetes changes the status of pods scheduled on it to Unknown , and attempts to schedule it on another node, with status ContainerCreating.
How to identify the issue
Run the below command:
kubectl get pods
Output:
NAME STATUS AGE VERSION
mynode-1 NotReady 1h v1.2.0
How to resolve this issue:
If the failed node is able to recover or is rebooted by the user, the issue will resolve itself. Once the failed node recovers and joins the cluster, the following process takes place:
- The pod with Unknown status is deleted, and volumes are detached from the failed node.
- The pod is rescheduled on the new node, its status changes from
Unknown
toContainerCreating
and required volumes are attached. - Kubernetes uses a five-minute timeout (by default), after which the pod will run on the node, and its status changes from ContainerCreating to Running.
If you have no time to wait, or the node does not recover, you’ll need to help Kubernetes reschedule the stateful pods on another, working node. There are two ways to achieve this:
- Remove failed node from the cluster: using the command
kubectl delete node <node name>
- Delete stateful pods with status unknown: using the command kubectl delete pods [pod_name] –grace-period=0 –force -n [namespace]
Error 2: ImagePullBackOff
What is it?
Ans: This error means that kubernetes failed to pull the container image for a Pod or Deployment. The “ImagePullBackOff” error occurs when Kubernetes attempts to pull the specified container image for a Pod or Deployment but fails.
Reasons of this issue:
- Invalid or non-existent image name: The image name specified in the Pod or Deployment configuration may be incorrect.
- Invalid credentials: If the container registry requires authentication, the credentials may be incorrect or missing.
- Network issues: There may be network connectivity issues between the Kubernetes cluster and the container registry.
- Image permissions: The Kubernetes nodes may not have the necessary permissions to pull the image.
How to Resolve the “ImagePullBackOff” Error?
Step 1: Check Pod or Deployment Status
Start by checking the status of the Pod or Deployment:
kubectl get pods
kubectl get deployments
Step 2: Check Pod or Deployment Logs
View the logs of the Pod or Deployment to look for any error messages:
kubectl logs <pod-name>
kubectl logs <deployment-name> -c <container-name>
Step 3: Check the following paramaters:
- Check Image Name: Ensure that the image name specified in the Pod or Deployment configuration is correct:
- Check ImagePull Secrets: If you’re using a private container registry, make sure the necessary ImagePull Secrets are configured
- Check Image Permissions: Make sure the nodes in the Kubernetes cluster have the necessary permissions to pull the image
- Check for Image Availability: Finally, ensure that the container image is available in the specified repository and that the repository is accessible
kubectl describe pod <pod-name>
kubectl describe deployment <deployment-name>
Step 4: Check kubernetes Evenets:
Review the Kubernetes events for any errors related to image pulling:
kubectl get events
Step 5: Check Image Registry Authentication
Verify that the credentials for accessing the container registry are correct:
kubectl describe secret <secret-name>
Step 6: Check Network Connectivity
Ensure that the Kubernetes cluster can reach the container registry:
kubectl run -it --rm --image=busybox --restart=Never busybox -- nslookup <registry-url>
Step 7: Retry Pulling the Image
If everything else looks correct, try deleting the Pod or Deployment to trigger a fresh attempt to pull the image:
kubectl delete pod <pod-name>
kubectl delete deployment <deployment-name>
Error 3: ErrImagePull / ImagePullBackOff
Root Cause: The “ErrImagePull” error occurs when kubernetes fails to pull the specified container image from the container registry. This issue can be frustrating, but it’s usually straightforward to diagnose and resolve.
Why Does the ErrImagePull Error Occur?
- Incorrect Image Name: The image name specified in the pod or deployment configuration might be incorrect.
- Invalid Credentials: If the container registry requires authentication, the credentials might be incorrect or missing.
- Network Connectivity Issues: There might be network issues between the Kubernetes cluster and the container registry.
- Image Permissions: The nodes in the Kubernetes cluster might not have the necessary permissions to pull the image.
How to Identify Issue:
Run the command:
kubectl get pods
Output:
NAME READY STATUS RESTARTS AGE
app-pod-1243 0/1 ImagePullBackOff 0 58s
How to Resolve the ErrImagePull Error?
- Wrong Image Name or Tag: This is happen because image or tag typed incorrectly in the pod manifest. Verify the correct image name using docker pull , and correct it in the pod manifest.
- Authenticaiton issue in container registery: The pod could not authenticate with the registry to retrieve the image. This could happen because of an issue in the Secret holding credentials, or because the pod does not have an RBAC role that allows it to perform the operation. Ensure the pod and node have the appropriate permissions and Secrets, then try the operation manually using docker pull.
- Check Image permissions: Make sure the nodes in the Kubernetes cluster have the necessary permissions to pull the image.
- Check for Image Availability: If everything else looks correct, try deleting the pod or deployment to trigger a fresh attempt to pull the image:
Error 4: CreateContainerConfigError
Root Cause:
This issue comes when secrets
or configmap
are missing.
Secrets are kubernetes objects used to store the sensitive information like database credentials. secrets are stored in the base64 encoding format.
ConfigMaps store data as key-value pairs, and are typically used to hold configuration information used by multiple pods. Configmap are stored the credentials in the plain text.
How to identify the issue:
1. Check the pods output:
kubectl get pods
Output:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
pod-missing-config 0/1 CreateContainerConfigError 0 1m23s
2. Get Detailed Information:
To get more information about the issue, run the below command:
kubectl describe pod pod-missing-config
Output:
Warning Failed 34s (x6 over 1m45s) kubelet
Error: configmap "configmap-3" not found
How to resolve issue:
If ConfigMap is missing, and you need to create it.
Error 5: CrashLoopBackOff
Root Cause: This issue indicates a pod cannot be scheduled on a node. This could happen because the node does not have sufficient resources to run the pod, or because the pod did not succeed in mounting the requested volumes.
How to identify the issue:
Run the below command:
kubectl get pods
Output:
NAME READY STATUS RESTARTS AGE
app-pod-1253 0/1 CrashLoopBackOff 0 58s
How to resolve this issue:
- Insufficient resources: if there are insufficient resources on the node, you can manually evict pods from the node or scale up your cluster to ensure more nodes are available for your pods.
- Volume mounting: if you see the issue is mounting a storage volume, check which volume the pod is trying to mount, ensure it is defined correctly in the pod manifest, and see that a storage volume with those definitions is available.
- Use of hostPort: if you are binding pods to a hostPort, you may only be able to schedule one pod per node. In most cases you can avoid using hostPort and use a Service object to enable communication with your pod.