A Transformer is a neural network architecture that processes sequences by learning which parts of the input to pay attention to. The architecture has two main blocks:
- Encoder: reads and understands the input
- Decoder: generates the output
We are going to start building each component of transformers one by one.
Task 1: Input Embeddings + Positional Encoding
Transformers take words as input, but neural networks need numbers. So we convert each word into a vector using an embedding layer. But here’s the problem: unlike RNNs, transformers process all words at once and have no sense of order. To fix this we add positional encoding, this tells the model where each word sits in the sequence.
The formula for positional encoding is:
for word_position in sequence:
for i in [total dimensions/number of features in each embedding] (step = 2):
# even
PE(pos, i) = sin(pos / 10000**(i/total dimensions))
# odd
PE(pos, i + 1) = sin(pos / 10000**(i/total dimensions))
Complete the code below:
import torch
import torch.nn as nn
import math
import numpy as np
class InputEmbedding(nn.Module):
def __init__(self, vocab_size, d_model):
super().__init__()
# d_model = size of each token vector
self.embedding = nn.Embedding(vocab_size, d_model)
self.d_model = d_model
def forward(self, x):
# Scale embeddings by sqrt(d_model)
return self.embedding(x) * math.sqrt(self.d_model)
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_seq_len=512, dropout=0.1):
super().__init__()
self.dropout = nn.Dropout(dropout)=
# TODO: Create a matrix of shape (max_seq_len, d_model)
# fill it using the sin/cost formula above
# Register it as a buffer (not a learnable parameter)
pass
def forward(self, x):
# TODO: Add positional encoding to x
# x shape: (batch, seq_len, d_model)
pass
Verify: Print the PE matrix and verify that it contains the value between -1 and 1.
Task 2: Calculating Attention Dot-Product
Attention is the core idea of transformers. It lets each word look at other words and decide how much to “focus” on them.
We compute attention using three vectors for each word:
- Q (Query): What am I looking for?
- K (Key): What do I contain?
- V (Value): What do I actually give?
The formula for Attention is:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
In pytorch, we often word with 4D tensors for attention mechanism. The shape looks like :
[Batch Size, Num Heads, Sequence Length, Head Dimension]
Complete the code below:
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.shape[-1] # retrive the last dimension of tensor which is Head Dimension
# TODO: Step 1 - compute scores: QK^T / sqrt(d_k)
'''
Hint:
To perform a matrix multiplication QK^T, the inner dimensions must match
- Q has shape (..., Sequence Length, Head Dimension)
- K originall has shape (..., Sequence Length, Head Dimension)
- By transposing the last two axes, K^T becomes (..., Head Dimension, Sequence Length)
'''
# TODO: Step 2 - apply mask (set masked positions to -1e9 before softmax)
# TODO: Step 3 - softmax over last dimension
# TODO: Step 4 - multiply by V
return output, attention_weights
Verify: Pass in random Q, K, V tensors and confirm output shape matches V’s shape. Confirm attention weights sum to 1 across the last dimension.
Task 3: Multi-Head Attention
Instead of running attention once, we run it h times in parallel with different learned projections. Each “head” learns to attend different things.
Idea:
- Linearly projects Q, K, V into h smaller versions
- Run attention in each head
- Concatenate all head outputs
- Pass through a final linear layer
Complete the code below:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.d_model = d_model
self.d_k = d_model // num_heads
self.num_heads = num_heads
self.W_Q = nn.Linear(d_model, d_model)
self.W_K = nn.Linear(d_model, d_model)
self.W_V = nn.Linear(d_model, d_model)
self.W_O = nn.Linear(d_model, d_model)
def forward(self, Q, K, V, mask=None):
batch_size = Q.shape[0]
# TODO: Step 1 - pass Q, K, V through their linear layers
'''
Hint: This will give output with shapes (batch, seq_len, d_model)
for all Q, K and V
'''
# TODO: Step 2 - reshape to (batch, num_heads, seq_len, d_k)
# TODO: Step 3 - call scaled_dot_product_attention
# TODO: Step 4 - reshape output back to (batch, seq_len, d_model)
# TODO: Step 5 - pass through W_O
pass
Verify: Both Input/output shape should be (batch, seq_len, d_model)
Task 4: Feed Forward Network + Residual Connection
After attention, each word passes through a small feed forward network independently. The dimensions of the feed forward network is usually 4x larger than the dimension of transformer (d_model).
Imagine where the attention is where you gather all the information. Then the Feed Forward network is where you think and learn about all the information that you have gathered. During learning you need to remember what you have learnt, the Residual Connection is where you store the memory of what you have learnt.
Complete the code below:
class FeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
# TODO: Define two linear layers and a dropout
pass
def forward(self, x):
# TODO: Implement the forward pass with ReLu and dropout between the linear layers
pass
class ResidualConnection(nn.Module):
def __init__(self, d_model, dropout=0.1):
super().__init__()
self.norm = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, sublayer):
# TODO: Apply: x + dropout(sublayer(norm(x)))
pass
Task 5: Encoder
Let’s stack what we have built so far together in order to build an EncoderBlock .
Complete the code below:
class EncoderBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads)
self.ff = FeedForward(d_model, d_ff, dropout)
self.residual1 = ResidualConnection(d_model, dropout)
self.residual2 = ResidualConnection(d_model, dropout)
def forward(self, x, mask=None):
# TODO: Pass x through attention with residual connection
'''
Hint:
Each token attends to all token
'''
# TODO: Pass result through feed-forward with residual connection
pass
We stack the Encoder Block together in order to form the Encoder.
Complete the code below:
class Encoder(nn.Module):
def __init__(self, num_layers, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
# TODO: Stack num_layers EncoderBlocks using nn.ModuleList
# TODO: Add a final LayerNorm
pass
def forward(self, x, mask=None):
# TODO: Pass x through each block, then the final norm
pass
Task 6: Decoder
The Decoder Block is similar to the encoder but it has an extra attention steps.
Self-Attention
Each token can only see the tokens before it. We also add masking for the tokens after the current token to prevents cheating during training. This ensures that the model have no way of seeing the tokens after the current token.
E.g.: You are writing sentence by sentence. You can only look back to what you already wrote. You cannot peak ahead at future sentence.
Cross Attention
In this attention mechanism, you are asking what word should you focus on based on the current word that you already have.
- Q: What word are you currently writing?
- K, V: The original sentence from the encoder.
Complete the code below:
class DecoderBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
'''
You are wiritng sentence by sentence.
You can look back at what you already wrote
You cannot peak ahead at future sentence
The mask prevents cheating -- you cannot see future words
'''
self.self_attention = MultiHeadAttention(d_model, num_heads)
'''
Q = What word am I writing now?
K, V = The original sentence from the encoder
You are asking what word should I focus on based on this word that I have currently?
'''
self.cross_attention = MultiHeadAttention(d_model, num_heads)
self.ff = FeedForward(d_model, d_ff, dropout)
self.residual1 = ResidualConnection(d_model, dropout)
self.residual2 = ResidualConnection(d_model, dropout)
self.residual3 = ResidualConnection(d_model, dropout)
def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
# TODO: Masked self-attention (Q=K=V=x, use tgt_mask)
# TODO: Cross-attention (Q=x, K=V=encoder_output, use src_mask)
# TODO: Feed Forward
pass
class Decoder(nn.Module):
def __init__(self, num_layers, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
# TODO: Stack num_layers EncoderBlocks using nn.ModuleList
# TODO: Add a final LayerNorm
pass
def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
# TODO: pass x through each block (with encoder_output), then final norm
pass
Task 7: Transformer
Now that we build all the component, let’s assemble them all together to build a transformer.
src → Embedding + PE → Encoder ──────────────────┐
tgt → Embedding + PE → Decoder (+ encoder output) → Linear → Softmax → Prediction
Complete the code below:
class Transformer(nn.Module):
def __init__(self, src_vocab, tgt_vocab, d_model=512, num_heads=8,
num_layers=6, d_ff=2048, max_seq_len=512, dropout=0.1):
super().__init__()
# TODO: Create src and tgt embedding layers
# TODO: Create src and tgt positional encodings
# TODO: Create Encoder and Decoder
# TODO: Create final linear projection layer (d_model → tgt_vocab)
pass
def forward(self, src, tgt, src_mask=None, tgt_mask=None):
# TODO: Embed + encode src
# TODO: Embed + decode tgt using encoder output
# TODO: Project to vocabulary
pass
Task 8: Test your transformer
Complete the code below:
# Hyperparameters
VOCAB_SIZE = 20
D_MODEL = 64
NUM_HEADS = 4
NUM_LAYERS = 2
D_FF = 128
SEQ_LEN = 10
BATCH_SIZE = 32
EPOCHS = 300
def generate_batch(batch_size, seq_len, vocab_size):
# Source: random token sequences
src = torch.randint(1, vocab_size, (batch_size, seq_len))
tgt = src.clone() # Target is the same as source (copy task)
return src, tgt
model = Transformer(VOCAB_SIZE, VOCAB_SIZE, D_MODEL, NUM_HEADS, NUM_LAYERS, D_FF)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# TODO: Write the training loop
# For each epoch:
# 1. Generate a batch
# 2. Forward pass (feed src and tgt[:-1] as decoder input)
# 3. Compare output to tgt[1:] (shifted by one - next token prediction)
# 4. Backprop and update
Play around: Try tuning the EPOCHS and Learning Rate to see how it changes the training.
You can test the post training of the model using the code below:
model.eval()
with torch.no_grad():
test_src = torch.tensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
test_tgt = torch.zeros((1, SEQ_LEN), dtype=torch.long)
for i in range(SEQ_LEN):
tgt_mask = torch.tril(torch.ones(i+1, i+1)).unsqueeze(0).unsqueeze(0)
output = model(test_src, test_tgt[:, :i+1], tgt_mask=tgt_mask)
next_token = output[:, -1, :].argmax(dim=-1)
if i < SEQ_LEN - 1:
test_tgt[:, i+1] = next_token
print("Input: ", test_src)
print("Output:", test_tgt)
Task 9: Optimization
After doing this experiment, you should notice that the model isn’t learning to copy correctly and the output is random. We are going to introduce delimiters which is the start token and the end token.
Problem without the start token:
Source: [1, 2, 3, 4, 5]
Target: [1, 2, 3, 4, 5] ← What should the FIRST prediction be?
Decoder sees: [] (nothing!)
Should predict: 1
But how does it know to output “1” when it hasn’t seen ANY context yet?
The model has no anchor point, it is randomly guessing for the first token every time.
After we introduce start token:
Source: [1, 2, 3, 4, 5]
Decoder Input: [0, 1, 2, 3, 4] ← Starts with 0
Target Output: [1, 2, 3, 4, 5] ← Shifted by one
Position 0: Sees [0] → predicts 1
Position 1: Sees [0, 1] → predicts 2
Position 2: Sees [0, 1, 2] → predicts 3
Now the decoder always has context to work with.
Complete the code:
# Hyperparameters
VOCAB_SIZE = 20
D_MODEL = 64
NUM_HEADS = 4
NUM_LAYERS = 2
D_FF = 128
SEQ_LEN = 10
BATCH_SIZE = 32
EPOCHS = 1000
def generate_batch(batch_size, seq_len, vocab_size):
src = torch.randint(1, vocab_size, (batch_size, seq_len))
# ADD [0, src...]
tgt_input = torch.cat([torch.zeros(batch_size, 1, dtype=torch.long), src], dim=1)
# ADD [src..., 0]
tgt_output = torch.cat([src, torch.zeros(batch_size, 1, dtype=torch.long)], dim=1)
return src, tgt_input, tgt_output
model = Transformer(VOCAB_SIZE, VOCAB_SIZE, D_MODEL, NUM_HEADS, NUM_LAYERS, D_FF)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()
# TODO: Write the training loop
# For each epoch:
# 1. Generate a batch
# 2. Forward pass (feed src and tgt_input as decoder input)
# 3. Compare output to tgt_output (shifted by one - next token prediction)
# 4. Backprop and update
You can test the post training with the code below:
# Post Training
model.eval()
with torch.no_grad():
test_src = torch.tensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
output_seq = [0] # ← Start with explicit start token
for i in range(SEQ_LEN):
tgt_input = torch.tensor([output_seq]) # ← Build sequence progressively
tgt_mask = torch.tril(torch.ones(len(output_seq), len(output_seq)))
output = model(test_src, tgt_input, tgt_mask=tgt_mask)
next_token = output[:, -1, :].argmax().item() # ← Get single value
output_seq.append(next_token) # ← Append to list
print("Input: ", test_src)
print("Output:", output_seq[1:]) # ← Skip start token [0]
Conclusion
Congratulations, you have built a functioning transformer model from scratch. You can refer to the complete code in this collab: Link to code
