Troubleshooting Real-World AWS EKS Issues in Production

by M Inamdar

Amazon EKS makes it easier to run Kubernetes workloads in the cloud but as any platform engineer knows, production grade reliability still demands deep visibility, sound architecture, and well drilled troubleshooting.

In this post, I’ll walk you through some real-world EKS incidents I’ve personally resolved. You’ll find:

  • Root causes (RCA)
  • Troubleshooting steps
  • Fixes and lessons learned
  • Diagrams and code

Let’s get into it 👇

1. Node in NotReady State

Symptoms:

  • kubectl get nodes → shows NotReady
  • Pods evicted or stuck
  • High node disk usage or kubelet crash

Fix:

# Check disk space
df -h

# Clear logs
sudo truncate -s 0 /var/log/containers/*.log

# Restart kubelet
sudo systemctl restart kubelet

# Replace node
kubectl drain <node> --ignore-daemonsets --delete-local-data
kubectl delete node <node>

🌐 Visual: Node in NotReady State

+------------------------+
|      EKS Node          |
|------------------------|
|  Kubelet process       |
|  Disk usage &gt; 85%      |
|  Memory pressure       |
+-----------+------------+
            |
            | Heartbeat to API server fails
            v
+------------------------+
| Kubernetes Control Plane|
|------------------------|
| Node marked NotReady    |
| Events generated        |
+-----------+------------+
            |
            | Admin runs diagnostics
            v
+------------------------+
| Resolution actions      |
| - Clear disk/logs       |
| - Restart kubelet       |
| - Drain/replace node    |
+------------------------+

2. LoadBalancer Service Stuck in Pending

Symptoms:

  • kubectl get svc → EXTERNAL-IP = <pending>
  • No ELB visible in AWS Console

Fix:

# Tag subnets
aws ec2 create-tags --resources <subnet-id> 
  --tags Key=kubernetes.io/cluster/<cluster>,Value=shared

# Install AWS Load Balancer Controller
helm repo add eks https://aws.github.io/eks-charts
helm install aws-load-balancer-controller eks/aws-load-balancer-controller 
  --set clusterName=<cluster-name> 
  --set serviceAccount.name=aws-load-balancer-controller 
  -n kube-system

🌐 Visual: EKS LoadBalancer Flow

+--------------------------+
|  Kubernetes Service (LB) |
|  Type: LoadBalancer      |
|  Exposes app externally  |
+-----------+--------------+
            |
            | Triggers AWS ELB provisioning
            v
+--------------------------+
|   aws-load-balancer-     |
|       controller         |
| (Installed via Helm)     |
+-----------+--------------+
            |
            | Creates ELB in AWS
            v
+--------------------------+
|     AWS Elastic LB       |
| - Internet-facing ELB    |
| - Listens on 80/443      |
+-----------+--------------+
            |
            | Forwards traffic to EKS nodes
            v
+--------------------------+
|     EKS Worker Nodes     |
| - Backed by Target Group |
| - Runs app pods          |
+--------------------------+

3. Pods in CrashLoopBackOff

Symptoms:

  • Application crashes repeatedly
  • kubectl describe pod shows BackOff
  • Logs show stack traces or missing configs

Fix:

# Check logs
kubectl logs <pod> -c <container>

# Edit deployment and fix
kubectl edit deployment <deployment-name>

🌐 Visual: CrashLoopBackOff Lifecycle

+------------------------+
|      Kubernetes        |
|     Deployment/Job     |
+----------+-------------+
           |
           | Schedules Pod
           v
+------------------------+
|        Pod Starts      |
|  init containers run   |
+----------+-------------+
           |
           | Starts Main Container
           v
+------------------------+
|  Container Crashes     |  &lt;---- App bug, bad config, secret missing
|  Exit Code ≠ 0         |
+----------+-------------+
           |
           | Kubernetes Restarts Pod
           v
+------------------------+
| CrashLoopBackOff Timer |
+------------------------+

4. API Server Latency & 5xx Errors

Symptoms:

  • kubectl commands slow or fail
  • High API server metrics in CloudWatch
  • Prometheus or controllers hammering API

Fix:

# Example Prometheus config
global:
  scrape_interval: 60s
  • Batch CRD updates
  • Add jitter/delay in loops

🌐 Visual: API Server Under Stress

+-----------------------+
|      Clients          |
| - kubectl, CI/CD      |
| - Prometheus scrapes  |
| - Controllers         |
+-----------+-----------+
            |
            | High-frequency requests
            v
+----------------------------+
|     Kubernetes API Server |
| - Latency / 5xx errors     |
| - Queued requests          |
+-----------+---------------+
            |
            v
+-----------------------------+
|            etcd             |
| - Slow writes/timeouts      |
+-----------------------------+

5. DNS Resolution Fails in Pod

Symptoms:

  • App can’t resolve DNS (e.g., can’t reach mydb.default.svc)
  • ping or nslookup fails inside pod
  • resolv.conf points to broken CoreDNS

Fix:

kubectl rollout restart deployment coredns -n kube-system

# Edit CoreDNS config if needed
kubectl edit configmap coredns -n kube-system

🌐 Visual: DNS Failure Inside Pod

+--------------------+        +---------------------+
|   Pod in EKS Node  |  --&gt;   |    CoreDNS Pod      |
| /etc/resolv.conf   |        | Resolves DNS queries|
| nameserver 10.x.x.x|        +----------+----------+
+--------+-----------+                   |
         |                               v
         |                     +---------------------+
         |                     |   VPC Network Layer  |
         |                     | - NACL / SG rules    |
         +--------------------&gt;+---------------------+
                                |
                                v
                        DNS Query Fails
                      (Timeout or No Response)
                                |
                                v
                    App Failure / Connection Error

💡 Final Takeaways

  • Always run postmortems
  • Ensure IAM/SG/Subnet configs are in place
  • Monitor metrics, logs, and event streams
  • Automate node drain and self-healing

About the Author

Mustkhim Inamdar
Cloud-Native DevOps Architect | Platform Engineer | CI/CD Specialist
Passionate about automation, scalability, and next-gen tooling. With years of experience across Big Data, Cloud Operations (AWS), CI/CD, and DevOps for automotive systems, I’ve delivered robust solutions using tools like Terraform, Jenkins, Kubernetes, LDRA, Polyspace, MATLAB/Simulink, and more.

I love exploring emerging tech like GitOps, MLOps, and Generative AI, and sharing practical insights from real-world projects.
🔗 LinkedIn
🔗 GitHub

Do bookmark ⭐ if you found this helpful. Comment below if you want me to share full runbooks or reusable Terraform modules I’ve built for EKS production clusters.

Leave a Reply