Design Highly Available And / Or Fault-Tolerant Architectures

Exam Guide: Solutions Architect – Associate
🧱 Domain 2: Design Secure Architectures
📘 Task Statement 2.2

🎯 Designing Highly Available And Fault Tolerant Architectures is about keeping workloads running despite failures.

  • High Availability (HA): the system stays up through component failures
  • Fault Tolerance (FT): the system continues operating with no interruption

Highly Available usually means Multi-AZ + load balancing + managed services + no single points of failure.

Knowledge

1 | AWS Global Infrastructure

AZs, Regions, Route 53

1 Availability Zones (AZs): isolated failure domains within a Region

2 Regions: separate geographic areas for disaster recovery

3 Amazon Route 53: DNS-based routing and health checks (common for regional failover)

“Must survive an AZ failure”Multi-AZ design.

“Must survive a regional outage”Multi-region DR + Route 53 failover.

2 | AWS Managed Services With Appropriate Use Cases

This bullet exists because managed services often include built-in HA scaling and reduce your operational risk.

Even if services like Comprehend or Polly aren’t HA topics by themselves, the exam tests the principle:

Prefer managed services when you want higher reliability with less custom work.

3 | Basic Networking Concepts

Route Tables

HA and FT depend on correct routing:
1 Public subnets route to an Internet Gateway (IGW)
2 Private subnets may route outbound via NAT Gateway
3 Multi-AZ designs require correct subnet/routing per AZ

4 | Disaster Recovery Strategies

RPO/RTO, backup-restore, pilot light, warm standby, active-active

Know these by cost vs recovery speed:

DR strategy What it is Typical RTO/RPO Cost
Backup & Restore restore from backups into a new environment Slow RTO, higher RPO Lowest
Pilot Light minimal core services running (e.g., DB + minimal infra) Medium RTO, medium RPO Low–Medium
Warm Standby scaled-down but fully functional stack always running Faster RTO, low RPO Medium–High
Active-Active both Regions serve traffic Lowest RTO/RPO Highest

If RTO/RPO are strict, the answer moves toward warm standby / active-active.

5 | Distributed Design Patterns

Common Resilience Patterns

1 Retry with backoff: avoid thundering herd
2 Timeouts: prevent resource exhaustion
3 Circuit breaker / bulkhead: limit cascade failures
4 Queue-based load leveling: SQS
5 Idempotency” safe retries
6 Multi-AZ deployment: for every critical tier

6 | Failover Strategies

Ways Failover Happens On AWS

1 Load balancer failover across targets in multiple AZs within a Region
2 Database failover: RDS Multi-AZ
3 DNS failover: Route 53 health checks across Regions
4 Client-side failover: apps try secondary endpoints

“Fail over between Regions”Route 53 failover routing (or latency-based + health checks).

7 | Immutable Infrastructure

Immutable means you don’t patch servers in place, you replace them:
1 Build a new AMI/container image
2 Deploy new instances/tasks
3 Terminate old ones

Benefits:

  • Consistency
  • Faster recovery
  • Lower configuration drift

“Ensure infrastructure integrity and repeatability”IaC + immutable deployments.

8 | Load Balancing Concepts

Application Load Balancer

  • ALB spreads traffic across targets in multiple AZs
  • Helps remove single-instance failure as a SPOF

9 | Proxy Concepts

Amazon RDS Proxy

RDS Proxy helps reliability especially for spiky/serverless workloads by:
1 Pooling and reusing DB connections
2 Reducing DB overload due to connection storms
3 Improving failover behavior for some patterns

“Lambda causes too many DB connections”RDS Proxy.

10 | Service Quotas And Throttling

Standby Environments

In DR scenarios, your standby Region or account must have enough quota to scale up.

Know that you can:

  • Check and adjust Service Quotas
  • Design for throttling with retries or backoff and buffering

11 | Storage Options And Characteristics

Durability And Replication

Storage durability affects architecture choices:
1 S3 is highly durable and regional with options like versioning and replication
2 EBS is replicated within an AZ and you can send snapshots to S3 for durability
3 EFS is regional and multi-AZ within a Region

12 | Workload Visibility

AWS X-Ray

Visibility supports HA by helping you detect and diagnose failures:

  • CloudWatch metrics/alarms for health and scaling
  • X-Ray for tracing distributed requests and finding bottlenecks

Skills

A | Determine Automation Strategies To Ensure Infrastructure Integrity

Look for:
1 Infrastructure as Code (CloudFormation/CDK/Terraform)
2 Automated deployments (blue/green, rolling)
3 Auto Scaling + health checks
4 Automated recovery actions (replace unhealthy instances/tasks)

B | Determine Services Required For HA/FT Across Regions or AZs

Common choices:

  • Multi-AZ: ALB + Auto Scaling + Multi-AZ database (RDS Multi-AZ)
  • Multi-region: Route 53 + replicated data + standby/active environment

“AZ outage must not cause downtime”Multi-AZ everything.

C | Identify Metrics Based On Business Requirements

Tie Monitoring To User-Impacting KPIs:

1 Availability / error rate (5xx)
2 Latency p95/p99
3 Queue depth / age (SQS)
4 CPU/memory/connections (compute/DB)
5 RPO/RTO compliance signals (backup success, replication lag)

D | Implement Designs To Mitigate Single Points Of Failure

Remove Single Points of Failure

1 Multi-AZ deployments
2 Redundant NAT Gateways (one per AZ for best practice)
3 Multi-AZ databases
4 Avoid single instance “pet” servers

E | Ensure Durability And Availability Of Data

Backups

1 Automated backups (RDS)
2 Snapshots (EBS, RDS)
3 S3 versioning + replication where required
4 AWS Backup policies when asked for centralized backup

F | Select An Appropriate DR Strategy To Meet Business Requirements

Use RTO/RPO to pick:
1 Backup/Restore (cheap, slow)
2 Pilot Light (medium)
3 Warm Standby (faster)
4 Active-Active (fastest, expensive)

G | Improve Reliability Of Legacy Apps

When app changes are not possible, use infrastructure patterns:

1 Put app behind ALB
2 Use Auto Scaling groups to replace failed instances
3 Use RDS Proxy to stabilize DB connections
4 Use caching to reduce backend load
5 Use DNS failover (Route 53) for regional DR

H | Use Purpose-Built AWS Services

Use managed services to reduce failure modes:

  • ALB, Auto Scaling, Route 53
  • RDS Multi-AZ, DynamoDB (managed HA)
  • SQS/SNS for decoupling spikes and failures
  • CloudFront for edge caching and origin protection

Cheat Sheet

Requirement Direction
Survive an instance failure Auto Scaling + health checks + ALB
Survive an AZ failure Multi-AZ for each tier (ALB targets across AZs, Multi-AZ DB)
Survive a Region failure DR strategy + Route 53 failover + replicated data
Strict RTO/RPO Warm standby or active-active
Lambda overwhelms RDS with connections RDS Proxy
Need to see bottlenecks across microservices X-Ray (plus CloudWatch)
Standby must scale during failover Plan Service Quotas + scaling policies

Recap Checklist ✅

1. [ ] Every critical tier is deployed across multiple AZs

2. [ ] Traffic is distributed via ALB/NLB and unhealthy targets are replaced automatically

3. [ ] Databases use HA features (e.g., RDS Multi-AZ or managed HA services)

4. [ ] DR strategy matches business RTO/RPO (backup/restore vs pilot light vs warm standby vs active-active)

5. [ ] Regional failover uses Route 53 health checks/routing (when required)

6. [ ] Data durability is addressed (backups, snapshots, replication)

7. [ ] Quotas and throttling are considered for failover/standby scaling

8. [ ] Monitoring and tracing exist (CloudWatch + X-Ray)

AWS Whitepapers and Official Documentation

These are the primary AWS documents behind Task Statement 2.2.

You do not need to memorize them, use them to understand why highly available and fault tolerant architectures work the way they do.

Global Infrastructure and DNS

Route 53

Disaster Recovery

Disaster Recovery on AWS

Networking Foundations

VPC Route Tables

Load Balancing and Reliability

1. Application Load Balancer
2. Auto Scaling (EC2)

Database Reliability

1. RDS Multi-AZ

2. RDS Proxy

Quotas and Limits

Service Quotas

Storage Durability / Replication

1. S3 Replication
2. EBS Snapshots:

3. EFS Overview

Visibility

1. AWS X-Ray
2. CloudWatch

Managed AI Services (examples from blueprint)

1. Amazon Comprehend
2. Amazon Polly

🚀

Leave a Reply