How I Designed GPU vs CPU Pipelines for a High-Volume ML Classification System

TL;DR: I architected a dual-pipeline system that routes ML inference workloads to either GPU or CPU based on batch size. GPUs process large batches in milliseconds per item with significant economies of scale. CPUs handle small batches and demos with zero cold start. The result: optimal cost efficiency across all workload sizes while maintaining SLA guarantees.

The Problem

I joined a company that processed millions of text documents through ML classification models. Every document needed to pass through multiple NLP models for categorization, entity extraction, and scoring. Think of it as a pipeline where raw text goes in and structured classifications come out.

The original architecture used a single pipeline with SageMaker Serverless endpoints. It worked fine at low volume, but cracks appeared as the client base grew:

Processing time scaled linearly with batch size. A 10,000 document batch took 10,000x longer than a single document. Large enterprise clients waited hours for results.
Cost per item stayed flat. We paid the same whether processing 1 document or 100,000. No economies of scale.
SageMaker Serverless has a 200 request concurrency limit per endpoint. Large batches would queue up and timeout.
Trial users and demos competed with production workloads. A sales demo could get stuck behind a large batch job.

The business needed two things simultaneously: fast turnaround for demos and small requests, plus cost-effective processing for enterprise batch jobs. These requirements seemed contradictory until I realized they needed different infrastructure entirely.

What I Tried First (And Why It Didn’t Work)

Attempt 1: Vertical Scaling

The obvious first move was bigger instances. We upgraded the SageMaker endpoint instance types.

Result: Marginal improvement. The bottleneck wasn’t compute power; it was the synchronous request pattern. Each Lambda function waited for SageMaker to respond before processing the next item. Faster instances just meant faster waiting.

Attempt 2: Horizontal Scaling with More Endpoints

We deployed multiple SageMaker endpoints and load-balanced across them.

Result: Better throughput, but costs scaled linearly. Processing 10x more volume cost 10x more money. Enterprise clients with large batches were becoming unprofitable.

Attempt 3: Understanding the Workload Characteristics

I spent two weeks analyzing our traffic patterns. The data revealed something interesting:

Batch Size	% of Requests	% of Total Volume	Latency Sensitivity
1-10	73%	2%	High (demos, trials)
11-100	18%	8%	Medium
101-1000	7%	15%	Low
1000+	2%	75%	Low (batch jobs)

Most requests were small, but most volume came from large batches. Small requests needed fast response times. Large batches could tolerate some startup latency if it meant lower cost per item.

This insight led to the dual-pipeline architecture.

The Solution

I designed two separate pipelines optimized for different workload characteristics:

Architecture Overview

GPU Pipeline (Large Batches)

SageMaker Async Inference Endpoints
Scales to zero when idle, spins up on demand
6-7 minute cold start, but milliseconds per inference
Cost-effective at scale due to GPU parallelism

CPU Pipeline (Small Batches)

SageMaker Serverless Inference Endpoints
Always warm, synchronous processing
Seconds per inference, but no cold start
Handles demos, trials, and small production requests
Higher cost per item, but predictable latency

Shared Infrastructure

DynamoDB for real-time state management
Aurora RDS Serverless for analytics
S3 for inference payloads and results
SQS queues between every microservice
Slack alerting at critical points

The Routing Logic

Every request hits an API Gateway that examines the batch size and routes accordingly:

def route_request(batch_size: int, is_demo: bool, is_trial: bool) -> str:
    """
    Route incoming requests to appropriate pipeline.
    Returns the API Gateway URL for the target pipeline.
    """
    # Demos and trials always go to CPU for fast response
    if is_demo or is_trial:
        return CPU_PIPELINE_GATEWAY

    # Large batches go to GPU for cost efficiency
    if batch_size > BATCH_THRESHOLD:
        return GPU_PIPELINE_GATEWAY

    # Small production requests go to CPU
    return CPU_PIPELINE_GATEWAY

The threshold was determined empirically. Below ~500 items, the GPU cold start time (6-7 minutes) made the total processing time worse than CPU. Above that threshold, GPU’s millisecond-per-item processing won decisively.

GPU Pipeline Implementation

The GPU pipeline uses SageMaker Async Inference, which works differently from synchronous endpoints:

Upload inference payload to S3
Send notification to endpoint’s internal queue
GPU fetches payload when ready
Results written to S3
SNS notification triggers downstream processing

# Lambda: GPU Pipeline Entry Point
import boto3
import json

s3 = boto3.client('s3')
sagemaker_runtime = boto3.client('sagemaker-runtime')

def invoke_gpu_inference(event, context):
    """
    Send batch to GPU pipeline via async inference.
    """
    batch_id = event['batch_id']
    documents = event['documents']

    # Upload payload to S3 (required for async inference)
    payload_key = f"inference-payloads/{batch_id}.json"
    s3.put_object(
        Bucket=INFERENCE_BUCKET,
        Key=payload_key,
        Body=json.dumps({'documents': documents})
    )

    # Invoke async inference
    response = sagemaker_runtime.invoke_endpoint_async(
        EndpointName='classifier-gpu',
        InputLocation=f"s3://{INFERENCE_BUCKET}/{payload_key}",
        ContentType='application/json',
        InvocationTimeoutSeconds=3600
    )

    # Store job metadata for tracking
    save_job_status(batch_id, 'processing', {
        'output_location': response['OutputLocation']
    })

    return {'statusCode': 202, 'body': json.dumps({'batch_id': batch_id})}

GPU Auto-Scaling (The Semi-Serverless Part)

GPUs are expensive. Keeping them running 24/7 wasn’t financially viable, but we needed them available when large batches arrived. The solution was CloudWatch-triggered scaling:

# Terraform: CloudWatch Alarm to Wake Up GPUs
resource "aws_cloudwatch_metric_alarm" "gpu_pipeline_traffic" {
  alarm_name          = "gpu-pipeline-incoming-traffic"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 60
  statistic           = "Sum"
  threshold           = 0

  dimensions = {
    QueueName = aws_sqs_queue.gpu_pipeline_entry.name
  }

  alarm_actions = [aws_lambda_function.scale_up_gpu.arn]
}

# EventBridge rule to check if GPUs can be turned off
resource "aws_cloudwatch_event_rule" "gpu_idle_check" {
  name                = "gpu-idle-check"
  schedule_expression = "rate(5 minutes)"
}

resource "aws_cloudwatch_event_target" "gpu_idle_check_target" {
  rule      = aws_cloudwatch_event_rule.gpu_idle_check.name
  target_id = "check-gpu-idle"
  arn       = aws_lambda_function.scale_down_gpu.arn
}

The scale-down Lambda checks if all queues are empty and no jobs are in progress before scaling the endpoint to zero instances.

CPU Pipeline Implementation

The CPU pipeline uses synchronous inference for immediate response:

# Lambda: CPU Pipeline Classifier
import boto3
import json

sagemaker_runtime = boto3.client('sagemaker-runtime')

def classify_documents_cpu(event, context):
    """
    Process documents through CPU inference endpoint.
    Synchronous - waits for response before returning.
    """
    for record in event['Records']:
        message = json.loads(record['body'])
        doc_id = message['doc_id']
        text = message['text']

        # Synchronous inference call
        response = sagemaker_runtime.invoke_endpoint(
            EndpointName='classifier-cpu-serverless',
            ContentType='application/json',
            Body=json.dumps({'text': text})
        )

        result = json.loads(response['Body'].read())

        # Write result and trigger next step
        save_classification(doc_id, result)
        trigger_next_microservice(doc_id, 'entity_extraction')

Queue Design for Flow Control

SQS queues between microservices served two purposes: modularity and backpressure.

# Terraform: Pipeline Queues with Flow Control
resource "aws_sqs_queue" "classifier_queue" {
  name                       = "classifier-queue-${var.pipeline_type}"
  visibility_timeout_seconds = var.pipeline_type == "gpu" ? 900 : 120
  message_retention_seconds  = 86400
  receive_wait_time_seconds  = 20

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.classifier_dlq.arn
    maxReceiveCount     = 3
  })
}

# Lambda event source mapping with concurrency control
resource "aws_lambda_event_source_mapping" "classifier_trigger" {
  event_source_arn                   = aws_sqs_queue.classifier_queue.arn
  function_name                      = aws_lambda_function.classifier.arn
  batch_size                         = var.pipeline_type == "gpu" ? 100 : 10
  maximum_batching_window_in_seconds = 5

  scaling_config {
    maximum_concurrency = var.pipeline_type == "gpu" ? 50 : 200
  }
}

The maximum_concurrency setting was critical. Without it, a traffic spike would spawn hundreds of Lambda invocations simultaneously, overwhelming downstream services and causing cascading failures.

Why DynamoDB Over RDS

Early in the project, we used Aurora RDS for everything. It couldn’t handle the write concurrency:

# What we observed with RDS under load
ERROR: SQLSTATE[40001]: Serialization failure: 1213 Deadlock found
ERROR: too many connections for role "app_user"

Even with connection pooling and read replicas, RDS hit a ceiling around 5,000 writes per second. DynamoDB’s horizontal scaling solved this:

# DynamoDB write with automatic scaling
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('processing-results')

def save_result(job_id: str, result: dict):
    """
    Write result to DynamoDB.
    On-demand capacity scales automatically.
    """
    table.put_item(
        Item={
            'job_id': job_id,
            'result': result,
            'timestamp': int(time.time()),
            'ttl': int(time.time()) + (90 * 24 * 60 * 60)  # 90 day retention
        }
    )

We kept Aurora RDS for analytics only, with eventual consistency replication from DynamoDB. Analytics queries don’t need real-time data, and RDS’s SQL capabilities made complex reporting much easier.

Monitoring and Alerting

Every critical point in the pipeline sends alerts to Slack on failure:

def send_slack_alert(service: str, error: str, context: dict):
    """
    Send alert to Slack channel for pipeline failures.
    """
    webhook_url = os.environ['SLACK_WEBHOOK_URL']

    message = {
        'text': f"Pipeline Alert: {service}",
        'attachments': [{
            'color': 'danger',
            'fields': [
                {'title': 'Service', 'value': service, 'short': True},
                {'title': 'Error', 'value': error, 'short': False},
                {'title': 'Context', 'value': json.dumps(context, indent=2), 'short': False}
            ]
        }]
    }

    requests.post(webhook_url, json=message)

We also built CloudWatch dashboards showing queue depths, processing latency, and GPU utilization side by side. When the GPU queue depth climbed but GPU utilization stayed at zero, we knew the cold start was happening. When CPU latency spiked, we knew a large batch had accidentally been routed to the wrong pipeline.

Results

After three months of production traffic:

Metric	Before (Single Pipeline)	After (Dual Pipeline)
Large batch cost per item	$0.012	$0.003
Small batch latency (p99)	45s	8s
Demo/trial response time	2-3 min	15s
Max throughput	50K items/hour	500K items/hour
Failed jobs per day	23	2

The GPU pipeline achieved 75% cost reduction for large batches through economies of scale. The CPU pipeline improved small batch latency by 80% by eliminating queue competition with batch jobs.

Where the Savings Came From

GPU parallelism. Processing 10,000 items on GPU takes roughly the same time as processing 100. The cold start is constant; the marginal cost per item approaches zero.
Right-sized infrastructure. Small requests no longer spun up expensive GPU instances. Large batches no longer occupied CPU endpoints for hours.
Scale-to-zero GPUs. We only pay for GPU time when actually processing. Nights and weekends cost nearly nothing.
Reduced failures. Fewer timeouts and deadlocks meant less re-processing and manual intervention.

Lessons Learned

1. Cold start isn’t always bad

The 6-7 minute GPU cold start seemed like a dealbreaker initially. But for batch jobs that run for hours anyway, adding 7 minutes to a 4-hour job is negligible. The key is routing requests appropriately based on latency tolerance.

2. SQS visibility timeout must exceed processing time

We lost jobs early on because visibility timeout was set to 30 seconds while some GPU inference calls took 2 minutes. The message would return to the queue and get processed again. Set visibility timeout to at least 2x your worst-case processing time.

3. DynamoDB for writes, RDS for reads

This pattern served us well. DynamoDB handles high-throughput writes without breaking a sweat. RDS handles complex analytical queries that would be expensive or impossible in DynamoDB. Use each for what it’s good at.

4. Separate queues enable independent scaling

Having dedicated queues for each microservice meant we could tune concurrency independently. The classifier might handle 50 concurrent invocations while the entity extractor handles 200. This granularity prevented bottlenecks from propagating.

5. Alert on queue depth, not just errors

Errors tell you something broke. Queue depth tells you something is about to break. We set CloudWatch alarms for queue depths exceeding historical norms, giving us time to investigate before customers noticed.

What I’d Do Differently

If starting over:

Use Step Functions for orchestration. The current Lambda-to-SQS-to-Lambda chain works, but Step Functions would provide better visibility into job progress and easier retry logic.
Implement circuit breakers earlier. We added them after an outage cascaded through the entire pipeline. Should have been there from day one.
Build the cost tracking dashboard first. We spent weeks estimating savings manually before building proper cost attribution. Real-time cost visibility would have accelerated optimization decisions.

Questions? Find me on LinkedIn.