How I Designed GPU vs CPU Pipelines for a High-Volume ML Classification System
TL;DR: I architected a dual-pipeline system that routes ML inference workloads to either GPU or CPU based on batch size. GPUs process large batches in milliseconds per item with significant economies of scale. CPUs handle small batches and demos with zero cold start. The result: optimal cost efficiency across all workload sizes while maintaining SLA guarantees.
The Problem
I joined a company that processed millions of text documents through ML classification models. Every document needed to pass through multiple NLP models for categorization, entity extraction, and scoring. Think of it as a pipeline where raw text goes in and structured classifications come out.
The original architecture used a single pipeline with SageMaker Serverless endpoints. It worked fine at low volume, but cracks appeared as the client base grew:
- Processing time scaled linearly with batch size. A 10,000 document batch took 10,000x longer than a single document. Large enterprise clients waited hours for results.
- Cost per item stayed flat. We paid the same whether processing 1 document or 100,000. No economies of scale.
- SageMaker Serverless has a 200 request concurrency limit per endpoint. Large batches would queue up and timeout.
- Trial users and demos competed with production workloads. A sales demo could get stuck behind a large batch job.
The business needed two things simultaneously: fast turnaround for demos and small requests, plus cost-effective processing for enterprise batch jobs. These requirements seemed contradictory until I realized they needed different infrastructure entirely.
What I Tried First (And Why It Didn’t Work)
Attempt 1: Vertical Scaling
The obvious first move was bigger instances. We upgraded the SageMaker endpoint instance types.
Result: Marginal improvement. The bottleneck wasn’t compute power; it was the synchronous request pattern. Each Lambda function waited for SageMaker to respond before processing the next item. Faster instances just meant faster waiting.
Attempt 2: Horizontal Scaling with More Endpoints
We deployed multiple SageMaker endpoints and load-balanced across them.
Result: Better throughput, but costs scaled linearly. Processing 10x more volume cost 10x more money. Enterprise clients with large batches were becoming unprofitable.
Attempt 3: Understanding the Workload Characteristics
I spent two weeks analyzing our traffic patterns. The data revealed something interesting:
| Batch Size | % of Requests | % of Total Volume | Latency Sensitivity |
|---|---|---|---|
| 1-10 | 73% | 2% | High (demos, trials) |
| 11-100 | 18% | 8% | Medium |
| 101-1000 | 7% | 15% | Low |
| 1000+ | 2% | 75% | Low (batch jobs) |
Most requests were small, but most volume came from large batches. Small requests needed fast response times. Large batches could tolerate some startup latency if it meant lower cost per item.
This insight led to the dual-pipeline architecture.
The Solution
I designed two separate pipelines optimized for different workload characteristics:
Architecture Overview
GPU Pipeline (Large Batches)
- SageMaker Async Inference Endpoints
- Scales to zero when idle, spins up on demand
- 6-7 minute cold start, but milliseconds per inference
- Cost-effective at scale due to GPU parallelism
CPU Pipeline (Small Batches)
- SageMaker Serverless Inference Endpoints
- Always warm, synchronous processing
- Seconds per inference, but no cold start
- Handles demos, trials, and small production requests
- Higher cost per item, but predictable latency
Shared Infrastructure
- DynamoDB for real-time state management
- Aurora RDS Serverless for analytics
- S3 for inference payloads and results
- SQS queues between every microservice
- Slack alerting at critical points
The Routing Logic
Every request hits an API Gateway that examines the batch size and routes accordingly:
def route_request(batch_size: int, is_demo: bool, is_trial: bool) -> str:
"""
Route incoming requests to appropriate pipeline.
Returns the API Gateway URL for the target pipeline.
"""
# Demos and trials always go to CPU for fast response
if is_demo or is_trial:
return CPU_PIPELINE_GATEWAY
# Large batches go to GPU for cost efficiency
if batch_size > BATCH_THRESHOLD:
return GPU_PIPELINE_GATEWAY
# Small production requests go to CPU
return CPU_PIPELINE_GATEWAY
The threshold was determined empirically. Below ~500 items, the GPU cold start time (6-7 minutes) made the total processing time worse than CPU. Above that threshold, GPU’s millisecond-per-item processing won decisively.
GPU Pipeline Implementation
The GPU pipeline uses SageMaker Async Inference, which works differently from synchronous endpoints:
- Upload inference payload to S3
- Send notification to endpoint’s internal queue
- GPU fetches payload when ready
- Results written to S3
- SNS notification triggers downstream processing
# Lambda: GPU Pipeline Entry Point
import boto3
import json
s3 = boto3.client('s3')
sagemaker_runtime = boto3.client('sagemaker-runtime')
def invoke_gpu_inference(event, context):
"""
Send batch to GPU pipeline via async inference.
"""
batch_id = event['batch_id']
documents = event['documents']
# Upload payload to S3 (required for async inference)
payload_key = f"inference-payloads/{batch_id}.json"
s3.put_object(
Bucket=INFERENCE_BUCKET,
Key=payload_key,
Body=json.dumps({'documents': documents})
)
# Invoke async inference
response = sagemaker_runtime.invoke_endpoint_async(
EndpointName='classifier-gpu',
InputLocation=f"s3://{INFERENCE_BUCKET}/{payload_key}",
ContentType='application/json',
InvocationTimeoutSeconds=3600
)
# Store job metadata for tracking
save_job_status(batch_id, 'processing', {
'output_location': response['OutputLocation']
})
return {'statusCode': 202, 'body': json.dumps({'batch_id': batch_id})}
GPU Auto-Scaling (The Semi-Serverless Part)
GPUs are expensive. Keeping them running 24/7 wasn’t financially viable, but we needed them available when large batches arrived. The solution was CloudWatch-triggered scaling:
# Terraform: CloudWatch Alarm to Wake Up GPUs
resource "aws_cloudwatch_metric_alarm" "gpu_pipeline_traffic" {
alarm_name = "gpu-pipeline-incoming-traffic"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
period = 60
statistic = "Sum"
threshold = 0
dimensions = {
QueueName = aws_sqs_queue.gpu_pipeline_entry.name
}
alarm_actions = [aws_lambda_function.scale_up_gpu.arn]
}
# EventBridge rule to check if GPUs can be turned off
resource "aws_cloudwatch_event_rule" "gpu_idle_check" {
name = "gpu-idle-check"
schedule_expression = "rate(5 minutes)"
}
resource "aws_cloudwatch_event_target" "gpu_idle_check_target" {
rule = aws_cloudwatch_event_rule.gpu_idle_check.name
target_id = "check-gpu-idle"
arn = aws_lambda_function.scale_down_gpu.arn
}
The scale-down Lambda checks if all queues are empty and no jobs are in progress before scaling the endpoint to zero instances.
CPU Pipeline Implementation
The CPU pipeline uses synchronous inference for immediate response:
# Lambda: CPU Pipeline Classifier
import boto3
import json
sagemaker_runtime = boto3.client('sagemaker-runtime')
def classify_documents_cpu(event, context):
"""
Process documents through CPU inference endpoint.
Synchronous - waits for response before returning.
"""
for record in event['Records']:
message = json.loads(record['body'])
doc_id = message['doc_id']
text = message['text']
# Synchronous inference call
response = sagemaker_runtime.invoke_endpoint(
EndpointName='classifier-cpu-serverless',
ContentType='application/json',
Body=json.dumps({'text': text})
)
result = json.loads(response['Body'].read())
# Write result and trigger next step
save_classification(doc_id, result)
trigger_next_microservice(doc_id, 'entity_extraction')
Queue Design for Flow Control
SQS queues between microservices served two purposes: modularity and backpressure.
# Terraform: Pipeline Queues with Flow Control
resource "aws_sqs_queue" "classifier_queue" {
name = "classifier-queue-${var.pipeline_type}"
visibility_timeout_seconds = var.pipeline_type == "gpu" ? 900 : 120
message_retention_seconds = 86400
receive_wait_time_seconds = 20
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.classifier_dlq.arn
maxReceiveCount = 3
})
}
# Lambda event source mapping with concurrency control
resource "aws_lambda_event_source_mapping" "classifier_trigger" {
event_source_arn = aws_sqs_queue.classifier_queue.arn
function_name = aws_lambda_function.classifier.arn
batch_size = var.pipeline_type == "gpu" ? 100 : 10
maximum_batching_window_in_seconds = 5
scaling_config {
maximum_concurrency = var.pipeline_type == "gpu" ? 50 : 200
}
}
The maximum_concurrency setting was critical. Without it, a traffic spike would spawn hundreds of Lambda invocations simultaneously, overwhelming downstream services and causing cascading failures.
Why DynamoDB Over RDS
Early in the project, we used Aurora RDS for everything. It couldn’t handle the write concurrency:
# What we observed with RDS under load
ERROR: SQLSTATE[40001]: Serialization failure: 1213 Deadlock found
ERROR: too many connections for role "app_user"
Even with connection pooling and read replicas, RDS hit a ceiling around 5,000 writes per second. DynamoDB’s horizontal scaling solved this:
# DynamoDB write with automatic scaling
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('processing-results')
def save_result(job_id: str, result: dict):
"""
Write result to DynamoDB.
On-demand capacity scales automatically.
"""
table.put_item(
Item={
'job_id': job_id,
'result': result,
'timestamp': int(time.time()),
'ttl': int(time.time()) + (90 * 24 * 60 * 60) # 90 day retention
}
)
We kept Aurora RDS for analytics only, with eventual consistency replication from DynamoDB. Analytics queries don’t need real-time data, and RDS’s SQL capabilities made complex reporting much easier.
Monitoring and Alerting
Every critical point in the pipeline sends alerts to Slack on failure:
def send_slack_alert(service: str, error: str, context: dict):
"""
Send alert to Slack channel for pipeline failures.
"""
webhook_url = os.environ['SLACK_WEBHOOK_URL']
message = {
'text': f"Pipeline Alert: {service}",
'attachments': [{
'color': 'danger',
'fields': [
{'title': 'Service', 'value': service, 'short': True},
{'title': 'Error', 'value': error, 'short': False},
{'title': 'Context', 'value': json.dumps(context, indent=2), 'short': False}
]
}]
}
requests.post(webhook_url, json=message)
We also built CloudWatch dashboards showing queue depths, processing latency, and GPU utilization side by side. When the GPU queue depth climbed but GPU utilization stayed at zero, we knew the cold start was happening. When CPU latency spiked, we knew a large batch had accidentally been routed to the wrong pipeline.
Results
After three months of production traffic:
| Metric | Before (Single Pipeline) | After (Dual Pipeline) |
|---|---|---|
| Large batch cost per item | $0.012 | $0.003 |
| Small batch latency (p99) | 45s | 8s |
| Demo/trial response time | 2-3 min | 15s |
| Max throughput | 50K items/hour | 500K items/hour |
| Failed jobs per day | 23 | 2 |
The GPU pipeline achieved 75% cost reduction for large batches through economies of scale. The CPU pipeline improved small batch latency by 80% by eliminating queue competition with batch jobs.
Where the Savings Came From
-
GPU parallelism. Processing 10,000 items on GPU takes roughly the same time as processing 100. The cold start is constant; the marginal cost per item approaches zero.
-
Right-sized infrastructure. Small requests no longer spun up expensive GPU instances. Large batches no longer occupied CPU endpoints for hours.
-
Scale-to-zero GPUs. We only pay for GPU time when actually processing. Nights and weekends cost nearly nothing.
-
Reduced failures. Fewer timeouts and deadlocks meant less re-processing and manual intervention.
Lessons Learned
1. Cold start isn’t always bad
The 6-7 minute GPU cold start seemed like a dealbreaker initially. But for batch jobs that run for hours anyway, adding 7 minutes to a 4-hour job is negligible. The key is routing requests appropriately based on latency tolerance.
2. SQS visibility timeout must exceed processing time
We lost jobs early on because visibility timeout was set to 30 seconds while some GPU inference calls took 2 minutes. The message would return to the queue and get processed again. Set visibility timeout to at least 2x your worst-case processing time.
3. DynamoDB for writes, RDS for reads
This pattern served us well. DynamoDB handles high-throughput writes without breaking a sweat. RDS handles complex analytical queries that would be expensive or impossible in DynamoDB. Use each for what it’s good at.
4. Separate queues enable independent scaling
Having dedicated queues for each microservice meant we could tune concurrency independently. The classifier might handle 50 concurrent invocations while the entity extractor handles 200. This granularity prevented bottlenecks from propagating.
5. Alert on queue depth, not just errors
Errors tell you something broke. Queue depth tells you something is about to break. We set CloudWatch alarms for queue depths exceeding historical norms, giving us time to investigate before customers noticed.
What I’d Do Differently
If starting over:
-
Use Step Functions for orchestration. The current Lambda-to-SQS-to-Lambda chain works, but Step Functions would provide better visibility into job progress and easier retry logic.
-
Implement circuit breakers earlier. We added them after an outage cascaded through the entire pipeline. Should have been there from day one.
-
Build the cost tracking dashboard first. We spent weeks estimating savings manually before building proper cost attribution. Real-time cost visibility would have accelerated optimization decisions.
Questions? Find me on LinkedIn.



