by M Inamdar
Amazon EKS makes it easier to run Kubernetes workloads in the cloud but as any platform engineer knows, production grade reliability still demands deep visibility, sound architecture, and well drilled troubleshooting.
In this post, I’ll walk you through some real-world EKS incidents I’ve personally resolved. You’ll find:
- Root causes (RCA)
- Troubleshooting steps
- Fixes and lessons learned
- Diagrams and code
Let’s get into it 👇
1. Node in NotReady
State
Symptoms:
-
kubectl get nodes
→ showsNotReady
- Pods evicted or stuck
- High node disk usage or kubelet crash
Fix:
# Check disk space
df -h
# Clear logs
sudo truncate -s 0 /var/log/containers/*.log
# Restart kubelet
sudo systemctl restart kubelet
# Replace node
kubectl drain <node> --ignore-daemonsets --delete-local-data
kubectl delete node <node>
🌐 Visual: Node in NotReady State
+------------------------+
| EKS Node |
|------------------------|
| Kubelet process |
| Disk usage > 85% |
| Memory pressure |
+-----------+------------+
|
| Heartbeat to API server fails
v
+------------------------+
| Kubernetes Control Plane|
|------------------------|
| Node marked NotReady |
| Events generated |
+-----------+------------+
|
| Admin runs diagnostics
v
+------------------------+
| Resolution actions |
| - Clear disk/logs |
| - Restart kubelet |
| - Drain/replace node |
+------------------------+
2. LoadBalancer Service Stuck in Pending
Symptoms:
-
kubectl get svc
→ EXTERNAL-IP =<pending>
- No ELB visible in AWS Console
Fix:
# Tag subnets
aws ec2 create-tags --resources <subnet-id>
--tags Key=kubernetes.io/cluster/<cluster>,Value=shared
# Install AWS Load Balancer Controller
helm repo add eks https://aws.github.io/eks-charts
helm install aws-load-balancer-controller eks/aws-load-balancer-controller
--set clusterName=<cluster-name>
--set serviceAccount.name=aws-load-balancer-controller
-n kube-system
🌐 Visual: EKS LoadBalancer Flow
+--------------------------+
| Kubernetes Service (LB) |
| Type: LoadBalancer |
| Exposes app externally |
+-----------+--------------+
|
| Triggers AWS ELB provisioning
v
+--------------------------+
| aws-load-balancer- |
| controller |
| (Installed via Helm) |
+-----------+--------------+
|
| Creates ELB in AWS
v
+--------------------------+
| AWS Elastic LB |
| - Internet-facing ELB |
| - Listens on 80/443 |
+-----------+--------------+
|
| Forwards traffic to EKS nodes
v
+--------------------------+
| EKS Worker Nodes |
| - Backed by Target Group |
| - Runs app pods |
+--------------------------+
3. Pods in CrashLoopBackOff
Symptoms:
- Application crashes repeatedly
-
kubectl describe pod
showsBackOff
- Logs show stack traces or missing configs
Fix:
# Check logs
kubectl logs <pod> -c <container>
# Edit deployment and fix
kubectl edit deployment <deployment-name>
🌐 Visual: CrashLoopBackOff Lifecycle
+------------------------+
| Kubernetes |
| Deployment/Job |
+----------+-------------+
|
| Schedules Pod
v
+------------------------+
| Pod Starts |
| init containers run |
+----------+-------------+
|
| Starts Main Container
v
+------------------------+
| Container Crashes | <---- App bug, bad config, secret missing
| Exit Code ≠ 0 |
+----------+-------------+
|
| Kubernetes Restarts Pod
v
+------------------------+
| CrashLoopBackOff Timer |
+------------------------+
4. API Server Latency & 5xx Errors
Symptoms:
-
kubectl
commands slow or fail - High API server metrics in CloudWatch
- Prometheus or controllers hammering API
Fix:
# Example Prometheus config
global:
scrape_interval: 60s
- Batch CRD updates
- Add jitter/delay in loops
🌐 Visual: API Server Under Stress
+-----------------------+
| Clients |
| - kubectl, CI/CD |
| - Prometheus scrapes |
| - Controllers |
+-----------+-----------+
|
| High-frequency requests
v
+----------------------------+
| Kubernetes API Server |
| - Latency / 5xx errors |
| - Queued requests |
+-----------+---------------+
|
v
+-----------------------------+
| etcd |
| - Slow writes/timeouts |
+-----------------------------+
5. DNS Resolution Fails in Pod
Symptoms:
- App can’t resolve DNS (e.g., can’t reach
mydb.default.svc
) -
ping
ornslookup
fails inside pod -
resolv.conf
points to broken CoreDNS
Fix:
kubectl rollout restart deployment coredns -n kube-system
# Edit CoreDNS config if needed
kubectl edit configmap coredns -n kube-system
🌐 Visual: DNS Failure Inside Pod
+--------------------+ +---------------------+
| Pod in EKS Node | --> | CoreDNS Pod |
| /etc/resolv.conf | | Resolves DNS queries|
| nameserver 10.x.x.x| +----------+----------+
+--------+-----------+ |
| v
| +---------------------+
| | VPC Network Layer |
| | - NACL / SG rules |
+-------------------->+---------------------+
|
v
DNS Query Fails
(Timeout or No Response)
|
v
App Failure / Connection Error
💡 Final Takeaways
- Always run postmortems
- Ensure IAM/SG/Subnet configs are in place
- Monitor metrics, logs, and event streams
- Automate node drain and self-healing
About the Author
Mustkhim Inamdar
Cloud-Native DevOps Architect | Platform Engineer | CI/CD Specialist
Passionate about automation, scalability, and next-gen tooling. With years of experience across Big Data, Cloud Operations (AWS), CI/CD, and DevOps for automotive systems, I’ve delivered robust solutions using tools like Terraform, Jenkins, Kubernetes, LDRA, Polyspace, MATLAB/Simulink, and more.
I love exploring emerging tech like GitOps, MLOps, and Generative AI, and sharing practical insights from real-world projects.
🔗 LinkedIn
🔗 GitHub
Do bookmark ⭐ if you found this helpful. Comment below if you want me to share full runbooks or reusable Terraform modules I’ve built for EKS production clusters.