System Design Interview: Autocomplete / Type-ahead System (Part 2)
In Part 1, we covered the problem statement, requirements, and capacity estimation.
In this part, we’ll deep-dive into the core data structures, ranking optimizations, and high-level system architecture that make autocomplete systems fast and scalable at Google-level traffic.
📌 Table of Contents (Part 2)
- Choosing the Core Data Structure
- Optimizing the Trie Structure
- How Top K Suggestions Are Retrieved
- Building and Maintaining Top K Lists
- High Level System Architecture
- Storage, Replication, and Availability
Choosing the Core Data Structure
Autocomplete is fundamentally a prefix-matching problem.
Given a prefix, return the most relevant strings that start with it — fast.
This immediately eliminates naive solutions like scanning a list or database table. We need a structure that is prefix-aware by design.
Why a Trie Works Well
A Trie (Prefix Tree) stores strings character by character.
- Each node represents a prefix
- All descendants share that prefix
Key Advantages
- Lookup time depends only on prefix length
- No scanning of unrelated strings
- Naturally groups similar queries
- Perfect fit for prefix search
Example
Stored queries:
be, bee, bell, bent, belt
Trie traversal for prefix "be" leads to a subtree containing:
bee, bell, bent, belt
This is exactly what autocomplete needs — fast prefix isolation.
Optimizing the Trie Structure
A naive trie becomes extremely large when storing billions of queries.
Problem with a Naive Trie
- Too many nodes
- Deep chains with single children
- High memory overhead
- Excessive pointer chasing
Compressed Trie (Radix Tree)
In many cases, nodes have only one child.
These chains can be merged into a single edge storing multiple characters.
Benefits
- Reduced memory usage
- Fewer pointer traversals
- Faster cache-friendly lookups
This optimized trie is often called a Radix Tree or Compact Trie.
Storing Popularity Information
How popular is a full search query?
Each complete query ends at a terminal node in the trie.
At that node, we store:
- Search frequency
- Or a weighted score combining frequency + recency
Example
User searches:
"apple" → 100 searches
"application" → 60 searches
"app store" → 200 searches
Trie view:
app
├── apple (freq=100)
├── application (freq=60)
└── app store (freq=200)
At this stage:
- We know popularity of full queries
- We cannot yet answer prefix queries fast
This layer answers:
“How popular is each query?”
How Top K Suggestions Are Retrieved
Given a prefix, how do we return the best suggestions instantly?
Naive Approach ❌
For prefix "be":
- Traverse to prefix node
- Traverse entire subtree
- Collect all terminal queries
- Sort by frequency
- Return top K
🚫 This is far too slow for real-time systems.
Pre Computed Top K Optimization ✅
To guarantee fast responses:
👉 Each trie node stores the top K suggestions for that prefix.
So at query time:
Traverse to prefix node → return stored top-K
No subtree traversal.
No sorting.
No computation on request path.
Storage Optimization
To reduce memory overhead:
- Store references to terminal nodes
- Store frequency alongside references
This is a classic space vs latency trade-off — and autocomplete is extremely read-heavy, so it’s worth it.
Building and Maintaining Top K Lists
Top-K lists are built bottom-up.
Build Process
- Leaf nodes know their own frequency
- Parent nodes merge children’s top-K lists
- Only top K entries are retained
- Process propagates upward to root
This ensures:
- Every prefix node has pre-sorted suggestions
- Lookup time is O(prefix length)
Two-Layer Mental Model
Think of autocomplete as two separate layers:
Layer 1: Popularity Tracking (Data Layer)
- Tracks how often full queries are searched
- Stored at terminal nodes
- Updated offline via logs and aggregation
Layer 2: Fast Retrieval (Serving Layer)
- Uses popularity data to compute rankings
- Stores only top-K per prefix
- Optimized purely for read latency
High Level System Architecture
Autocomplete is a read-heavy, latency-critical pipeline.
Serve from memory first → fall back to structured storage → never compute on the request path
Autocomplete is not a compute problem — it’s a data access problem.
HLD Goal
Show how traffic flows end-to-end and where latency is optimized.
At HLD level, interviewers want to see:
- Clear separation of responsibilities
- Read-heavy optimization
- Scalability and availability
🔹 High-Level Architecture Diagram
+-------------------+
| User (Browser) |
+-------------------+
|
v
+-------------------+
| CDN / Edge |
| (Optional Cache) |
+-------------------+
|
v
+-------------------+
| Global Load |
| Balancer |
+-------------------+
|
v
+-------------------+
| Stateless App |
| Servers |
+-------------------+
|
v
+-------------------+
| In-Memory Cache |
| (Redis) |
+-------------------+
|
Cache Miss
v
+-------------------+
| Distributed Trie |
| Storage |
+-------------------+
🔹 HLD Component Responsibilities
1. Client (Browser / Mobile)
- Sends only prefix, not full query
- Debounces keystrokes (e.g., 100–150 ms)
- Reduces unnecessary backend calls
2. CDN / Edge Cache (Optional but Powerful)
- Caches extremely hot prefixes (
"a","how","the") - Reduces round-trip latency
- Shields backend during traffic spikes
3. Global Load Balancer
- Routes user to nearest region
- Distributes traffic evenly
- Removes unhealthy instances
4. Stateless Application Servers
- Input normalization (
"How "→"how") - Cache lookup orchestration
- Fallback to trie storage
- No state → easy horizontal scaling
5. Redis (Hot Prefix Cache)
- Stores top-K suggestions per prefix
- Sub-millisecond response
- TTL auto-expires stale results
6. Distributed Trie Storage
- Stores pre-computed top-K per prefix
- Read-only on serving path
- Partitioned + replicated
🎯 HLD Interview Explanation (1–2 lines)
“Autocomplete is a read-heavy system, so we always try to serve from memory first.
Only on cache miss do we hit distributed trie storage, which already has pre-computed top-K results.”
Storage, Replication, and Availability
A full autocomplete trie cannot live on a single machine.
Distributed Trie Storage
The trie is:
- Logically one structure
- Physically distributed across many servers
Partitioning Strategy
Common approach: prefix-based partitioning
Examples:
-
a–c→ Shard 1 -
d–f→ Shard 2 - Hot prefixes (
a,s,th) further subdivided
This keeps shards balanced and scalable.
Replication Strategy
Each trie partition is replicated across multiple servers.
Why Replication Matters
-
High Availability
Node failures don’t impact service -
Fault Tolerance
Handles hardware failures and deployments -
Read Scalability
Replicas share massive QPS load
Storage Technology Choices
Cassandra (Durable Storage)
Best fit when:
- Large trie partitions
- Very high read throughput
- Cross-region replication needed
Usage pattern:
- Trie partitions stored as serialized blocks
- Loaded into memory on startup
- Cassandra acts as durable backing store
Strengths:
- Horizontal scalability
- High availability
- Tunable consistency
ZooKeeper (Metadata & Coordination)
ZooKeeper is not used to store trie data.
It is used for:
- Partition metadata
- Prefix-to-server mappings
- Leader election (if required)
- Atomic version switching during trie rebuilds
Replication Guarantees
Replication ensures:
- ✅ High availability
- ✅ Fault tolerance
- ✅ Read scalability
🎯 Final Takeaway (Part 2)
A real-world autocomplete system succeeds because it:
- Uses tries for prefix isolation
- Pre-computes top-K suggestions
- Serves requests from memory first
- Pushes all heavy computation offline
- Scales via partitioning and replication
This is the level of clarity interviewers look for.
