The Gaming Analytics Tech Stack: From Ingestion to Insights

Production-grade event pipeline architecture for processing 5 billion daily events

Processing 5 billion events per day from millions of concurrent players requires an architecture designed for fault tolerance and operational excellence. Single points of failure result in lost revenue data. This article presents the production architecture proven at Call of Duty Mobile (100M+ downloads week one) and Amazon Games (5B+ events daily).

Two Event Types, One Critical Distinction

Gaming analytics processes two fundamentally different event sources:

Client Events (diagnostic data):

Performance metrics, crash reports, UI interactions
High volume with 10x traffic spikes during releases
Important for product quality, not revenue-blocking

Server Events (business-critical data):

Player transactions, purchases, progression
Steady traffic tied to concurrent player count
Source of truth—cannot trust client-side data for revenue

This distinction shapes every architectural decision. Client-side timestamps can be manipulated. Server events are authoritative.

The Architecture: Fault Isolation as a Design Principle

A critical decision: unified processing or separate paths? A single Lambda offers simplicity, but creates a single point of failure.

The Production Reality: A game client release introduces a schema change. Client events fail validation. With unified processing, this blocks all events—including revenue-critical server data.

The Solution: Isolated failure domains through separate processing paths.

Layer 1: Dual Ingestion Paths

Client Path (Blue):

API Gateway /client-events → Client PII Lambda
Handles performance diagnostics and crash telemetry
Scales independently for traffic spikes

Server Path (Green):

API Gateway /server-events → Server PII Lambda
Processes transactions and progression events
Revenue-critical, requires higher reliability

Layer 2: PII Removal (Compliance Layer)

Both Lambdas strip personally identifiable information before storage, ensuring GDPR/CCPA compliance:

# Shared PII Library (Lambda Layer)
def remove_pii(event: dict, source: str) -> dict:
    return {
        'event_source': source,
        'user_id_hash': hash_value(event.get('user_id')),
        'email_hash': hash_value(event.get('email')),
        'geo_country': extract_geo(event.get('ip_address')),
        # Remove PII entirely
        'player_name': None,
        'ip_address': None,
        # Preserve analytics data
        'game_id': event.get('game_id'),
        'event_type': event.get('event_type')
    }

Why Separate Lambdas?

Client failure doesn’t block server events
Independent deployment and scaling
Shared code library ensures consistency

What happens on failure?
Events route to Dead Letter Queues (DLQs) after 2 retries. CloudWatch alarms trigger at different thresholds:

Client DLQ: 50 messages → Medium priority
Server DLQ: 10 messages → Critical priority (revenue at risk)

Layer 3: Convergence and Dual Processing

After PII removal, events converge at Kinesis Data Streams, then split based on latency requirements:

Hot Path (1-5 minute latency):

Kinesis → Lambda → DynamoDB → Real-time Dashboard

Use cases: Concurrent users, fraud detection, server health

Cold Path (hourly updates):

Kinesis → Firehose → S3 → Airflow → Redshift → Analytics

Use cases: Retention analysis, revenue reporting, A/B testing

Cost Reality: Streaming 5B events/day to DynamoDB costs ~$50K/month. Batch processing via S3/Redshift costs <$5K/month for 95% of use cases.

Infrastructure as Code

Complete stack deployed via AWS CDK:

Version controlled in Git
Reproducible dev/staging/prod environments
New game analytics stack deploys in under 60 minutes

Production Case Study: Traffic Spike Resilience

Scenario: Content release drives 8x traffic spike (150K events/sec vs 20K baseline).

System Response:

Client PII Lambda hits 3,000 concurrent execution limit
Client DLQ accumulates 50+ messages → Alert triggers
Server Lambda operates normally—revenue tracking unaffected
Kinesis buffers events successfully
Hot path dashboards show player surge as expected

Resolution:

Increased Client Lambda concurrency to 10,000
Replayed failed events from DLQ
Zero data loss

Why it worked: Fault isolation prevented client issues from impacting server event processing.

Key Architectural Decisions

Why Kinesis instead of Kafka?

Managed service eliminates cluster operations. Native AWS integration. Auto-scaling. At 1B-10B events/day scale, managed services provide better TCO when operational overhead is factored.

Why separate Lambdas?

Fault isolation and independent deployment outweigh the cost of managing multiple functions. Shared Lambda Layer provides code reuse without deployment coupling.

Why dual processing paths?

Cost optimization. 95% of analytics (retention, revenue, A/B tests) work with hourly updates. Reserve real-time processing for operational metrics requiring sub-5-minute latency.

Results

This architecture has processed:

100M+ user game launches without data loss
8x traffic spikes during content releases
Multiple schema migration incidents with zero downtime
5B+ events daily across multiple game titles

The key patterns—fault isolation, comprehensive error handling, differentiated monitoring—enable reliable processing at scale while maintaining operational simplicity.

What’s Next in This Series

Part 1: Breaking into Gaming Analytics and Data Engineering in 2025
Part 2: The Gaming Analytics Tech Stack: From Ingestion to Insights (Current Article)
Part 3: Building Event Pipelines at Scale
Part 4: Near Real-Time Analytics (why not “true” real-time)
Part 5: Data Modeling for Games (Star Schema)
Part 6: Player Retention & Engagement Analytics
Part 7: Game Economy & Monetization
Part 8: Infrastructure as Code with AWS CDK
Part 9: Data Quality & Governance at Scale
Part 10: Leading Analytics Teams & Scaling Impact

Each post will include real architectures, code examples, and lessons from systems serving millions of players.

About Me: I’m Sai Krishna Chaitanya Chigili, Data Engineer at Amazon Games, where I architect infrastructure processing 5B+ daily player events. Previously founding analyst for Call of Duty Mobile at Activision Blizzard.

Connect: LinkedIn | GitHub | chigili.dev

Questions about breaking into gaming analytics? Drop a comment or DM me on LinkedIn. I try to respond to everyone.

Don’t miss the series: Follow me here on DEV for Parts 2–10.

Originally published on Medium

If this post helped you, give it a ❤️ and share it with someone considering a career in gaming analytics. And if you’re already in the industry, I’d love to hear your journey in the comments.