Building Transformer from Scratch

A Transformer is a neural network architecture that processes sequences by learning which parts of the input to pay attention to. The architecture has two main blocks:

  • Encoder: reads and understands the input
  • Decoder: generates the output

We are going to start building each component of transformers one by one.

Task 1: Input Embeddings + Positional Encoding

Transformers take words as input, but neural networks need numbers. So we convert each word into a vector using an embedding layer. But here’s the problem: unlike RNNs, transformers process all words at once and have no sense of order. To fix this we add positional encoding, this tells the model where each word sits in the sequence.

The formula for positional encoding is:

for word_position in sequence:
  for i in [total dimensions/number of features in each embedding] (step = 2):
    # even
    PE(pos, i) = sin(pos / 10000**(i/total dimensions))
    # odd
    PE(pos, i + 1) = sin(pos / 10000**(i/total dimensions))

Complete the code below:

import torch
import torch.nn as nn
import math
import numpy as np

class InputEmbedding(nn.Module):
  def __init__(self, vocab_size, d_model):
    super().__init__()
    # d_model = size of each token vector
    self.embedding = nn.Embedding(vocab_size, d_model)
    self.d_model = d_model

  def forward(self, x):
    # Scale embeddings by sqrt(d_model)
    return self.embedding(x) * math.sqrt(self.d_model)

class PositionalEncoding(nn.Module):
  def __init__(self, d_model, max_seq_len=512, dropout=0.1):
    super().__init__()
    self.dropout = nn.Dropout(dropout)=
    # TODO: Create a matrix of shape (max_seq_len, d_model)
    # fill it using the sin/cost formula above
    # Register it as a buffer (not a learnable parameter)
    pass

  def forward(self, x):
    # TODO: Add positional encoding to x
    # x shape: (batch, seq_len, d_model)
    pass

Verify: Print the PE matrix and verify that it contains the value between -1 and 1.

Task 2: Calculating Attention Dot-Product

Attention is the core idea of transformers. It lets each word look at other words and decide how much to “focus” on them.

We compute attention using three vectors for each word:

  • Q (Query): What am I looking for?
  • K (Key): What do I contain?
  • V (Value): What do I actually give?

The formula for Attention is:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

In pytorch, we often word with 4D tensors for attention mechanism. The shape looks like :

[Batch Size, Num Heads, Sequence Length, Head Dimension]

Complete the code below:

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1] # retrive the last dimension of tensor which is Head Dimension

    # TODO: Step 1 - compute scores: QK^T / sqrt(d_k)
    '''
    Hint:
      To perform a matrix multiplication QK^T, the inner dimensions must match
      - Q has shape (..., Sequence Length, Head Dimension)
      - K originall has shape (..., Sequence Length, Head Dimension)
      - By transposing the last two axes, K^T becomes (..., Head Dimension, Sequence Length)
    '''
    # TODO: Step 2 - apply mask (set masked positions to -1e9 before softmax)
    # TODO: Step 3 - softmax over last dimension
    # TODO: Step 4 - multiply by V

    return output, attention_weights

Verify: Pass in random Q, K, V tensors and confirm output shape matches V’s shape. Confirm attention weights sum to 1 across the last dimension.

Task 3: Multi-Head Attention

Instead of running attention once, we run it h times in parallel with different learned projections. Each “head” learns to attend different things.

Idea:

  1. Linearly projects Q, K, V into h smaller versions
  2. Run attention in each head
  3. Concatenate all head outputs
  4. Pass through a final linear layer

Complete the code below:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.d_k = d_model // num_heads
        self.num_heads = num_heads

        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)

    def forward(self, Q, K, V, mask=None):
        batch_size = Q.shape[0]

        # TODO: Step 1 - pass Q, K, V through their linear layers
        '''
           Hint: This will give output with shapes (batch, seq_len, d_model) 
           for all Q, K and V
        '''
        # TODO: Step 2 - reshape to (batch, num_heads, seq_len, d_k)
        # TODO: Step 3 - call scaled_dot_product_attention
        # TODO: Step 4 - reshape output back to (batch, seq_len, d_model)
        # TODO: Step 5 - pass through W_O
        pass

Verify: Both Input/output shape should be (batch, seq_len, d_model)

Task 4: Feed Forward Network + Residual Connection

After attention, each word passes through a small feed forward network independently. The dimensions of the feed forward network is usually 4x larger than the dimension of transformer (d_model).
Imagine where the attention is where you gather all the information. Then the Feed Forward network is where you think and learn about all the information that you have gathered. During learning you need to remember what you have learnt, the Residual Connection is where you store the memory of what you have learnt.

Complete the code below:

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        # TODO: Define two linear layers and a dropout
        pass

    def forward(self, x):
        # TODO: Implement the forward pass with ReLu and dropout between the linear layers
        pass

class ResidualConnection(nn.Module):
    def __init__(self, d_model, dropout=0.1):
        super().__init__()
        self.norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        # TODO: Apply: x + dropout(sublayer(norm(x)))
        pass

Task 5: Encoder

Let’s stack what we have built so far together in order to build an EncoderBlock .

Complete the code below:

class EncoderBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.ff = FeedForward(d_model, d_ff, dropout)
        self.residual1 = ResidualConnection(d_model, dropout)
        self.residual2 = ResidualConnection(d_model, dropout)

    def forward(self, x, mask=None):
        # TODO: Pass x through attention with residual connection
        '''
        Hint:
          Each token attends to all token
        '''
        # TODO: Pass result through feed-forward with residual connection
        pass

We stack the Encoder Block together in order to form the Encoder.

Complete the code below:

class Encoder(nn.Module):
    def __init__(self, num_layers, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        # TODO: Stack num_layers EncoderBlocks using nn.ModuleList
        # TODO: Add a final LayerNorm
        pass

    def forward(self, x, mask=None):
        # TODO: Pass x through each block, then the final norm
        pass

Task 6: Decoder

The Decoder Block is similar to the encoder but it has an extra attention steps.

Self-Attention

Each token can only see the tokens before it. We also add masking for the tokens after the current token to prevents cheating during training. This ensures that the model have no way of seeing the tokens after the current token.

E.g.: You are writing sentence by sentence. You can only look back to what you already wrote. You cannot peak ahead at future sentence.

Cross Attention

In this attention mechanism, you are asking what word should you focus on based on the current word that you already have. 

  • Q: What word are you currently writing?
  • K, V: The original sentence from the encoder.

Complete the code below:

class DecoderBlock(nn.Module):
  def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
    super().__init__()
    '''
    You are wiritng sentence by sentence. 
    You can look back at what you already wrote
    You cannot peak ahead at future sentence
    The mask prevents cheating -- you cannot see future words
    '''
    self.self_attention = MultiHeadAttention(d_model, num_heads)
    '''
    Q = What word am I writing now?
    K, V = The original sentence from the encoder
    You are asking what word should I focus on based on this word that I have currently?
    '''
    self.cross_attention = MultiHeadAttention(d_model, num_heads) 
    self.ff = FeedForward(d_model, d_ff, dropout)
    self.residual1 = ResidualConnection(d_model, dropout)
    self.residual2 = ResidualConnection(d_model, dropout)
    self.residual3 = ResidualConnection(d_model, dropout)

  def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
    # TODO: Masked self-attention (Q=K=V=x, use tgt_mask)
    # TODO: Cross-attention (Q=x, K=V=encoder_output, use src_mask)
    # TODO: Feed Forward
    pass

class Decoder(nn.Module):
  def __init__(self, num_layers, d_model, num_heads, d_ff, dropout=0.1):
    super().__init__()
    # TODO: Stack num_layers EncoderBlocks using nn.ModuleList
    # TODO: Add a final LayerNorm
    pass

  def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
    # TODO: pass x through each block (with encoder_output), then final norm
    pass

Task 7: Transformer

Now that we build all the component, let’s assemble them all together to build a transformer.

src → Embedding + PE → Encoder ──────────────────┐
tgt → Embedding + PE → Decoder (+ encoder output) → Linear → Softmax → Prediction

Complete the code below:

class Transformer(nn.Module):
    def __init__(self, src_vocab, tgt_vocab, d_model=512, num_heads=8,
                 num_layers=6, d_ff=2048, max_seq_len=512, dropout=0.1):
        super().__init__()
        # TODO: Create src and tgt embedding layers
        # TODO: Create src and tgt positional encodings
        # TODO: Create Encoder and Decoder
        # TODO: Create final linear projection layer (d_model → tgt_vocab)
        pass

    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        # TODO: Embed + encode src
        # TODO: Embed + decode tgt using encoder output
        # TODO: Project to vocabulary
        pass

Task 8: Test your transformer

Complete the code below:

# Hyperparameters
VOCAB_SIZE = 20
D_MODEL = 64
NUM_HEADS = 4
NUM_LAYERS = 2
D_FF = 128
SEQ_LEN = 10
BATCH_SIZE = 32
EPOCHS = 300

def generate_batch(batch_size, seq_len, vocab_size):
    # Source: random token sequences
    src = torch.randint(1, vocab_size, (batch_size, seq_len))
    tgt = src.clone()  # Target is the same as source (copy task)
    return src, tgt

model = Transformer(VOCAB_SIZE, VOCAB_SIZE, D_MODEL, NUM_HEADS, NUM_LAYERS, D_FF)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# TODO: Write the training loop
# For each epoch:
#   1. Generate a batch
#   2. Forward pass (feed src and tgt[:-1] as decoder input)
#   3. Compare output to tgt[1:] (shifted by one - next token prediction)
#   4. Backprop and update
Play around: Try tuning the EPOCHS and Learning Rate to see how it changes the training.
You can test the post training of the model using the code below:
model.eval()
with torch.no_grad():
    test_src = torch.tensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
    test_tgt = torch.zeros((1, SEQ_LEN), dtype=torch.long)

    for i in range(SEQ_LEN):
        tgt_mask = torch.tril(torch.ones(i+1, i+1)).unsqueeze(0).unsqueeze(0)
        output = model(test_src, test_tgt[:, :i+1], tgt_mask=tgt_mask)
        next_token = output[:, -1, :].argmax(dim=-1)
        if i < SEQ_LEN - 1:
            test_tgt[:, i+1] = next_token

    print("Input: ", test_src)
    print("Output:", test_tgt)

Task 9: Optimization

After doing this experiment, you should notice that the model isn’t learning to copy correctly and the output is random. We are going to introduce delimiters which is the start token and the end token.
Problem without the start token:

Source:     [1, 2, 3, 4, 5]
Target: [1, 2, 3, 4, 5]  ← What should the FIRST prediction be?

Decoder sees: [] (nothing!)
Should predict: 1

But how does it know to output “1” when it hasn’t seen ANY context yet?

The model has no anchor point, it is randomly guessing for the first token every time.

After we introduce start token:

Source:          [1, 2, 3, 4, 5]
Decoder Input:   [0, 1, 2, 3, 4]  ← Starts with 0
Target Output:   [1, 2, 3, 4, 5]  ← Shifted by one

Position 0: Sees [0]       → predicts 1
Position 1: Sees [0, 1]    → predicts 2
Position 2: Sees [0, 1, 2] → predicts 3

Now the decoder always has context to work with.

Complete the code:

# Hyperparameters
VOCAB_SIZE = 20
D_MODEL = 64
NUM_HEADS = 4
NUM_LAYERS = 2
D_FF = 128
SEQ_LEN = 10
BATCH_SIZE = 32
EPOCHS = 1000

def generate_batch(batch_size, seq_len, vocab_size):
    src = torch.randint(1, vocab_size, (batch_size, seq_len))
    # ADD [0, src...]
    tgt_input = torch.cat([torch.zeros(batch_size, 1, dtype=torch.long), src], dim=1)
    # ADD [src..., 0]
    tgt_output = torch.cat([src, torch.zeros(batch_size, 1, dtype=torch.long)], dim=1) 
    return src, tgt_input, tgt_output

model = Transformer(VOCAB_SIZE, VOCAB_SIZE, D_MODEL, NUM_HEADS, NUM_LAYERS, D_FF)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# TODO: Write the training loop
# For each epoch:
#   1. Generate a batch
#   2. Forward pass (feed src and tgt_input as decoder input)
#   3. Compare output to tgt_output (shifted by one - next token prediction)
#   4. Backprop and update
You can test the post training with the code below:
# Post Training
model.eval()
with torch.no_grad():
    test_src = torch.tensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
    output_seq = [0]  # ← Start with explicit start token

    for i in range(SEQ_LEN):
        tgt_input = torch.tensor([output_seq])  # ← Build sequence progressively
        tgt_mask = torch.tril(torch.ones(len(output_seq), len(output_seq)))
        output = model(test_src, tgt_input, tgt_mask=tgt_mask)
        next_token = output[:, -1, :].argmax().item()  # ← Get single value
        output_seq.append(next_token)  # ← Append to list

    print("Input: ", test_src)
    print("Output:", output_seq[1:])  # ← Skip start token [0]

Conclusion

Congratulations, you have built a functioning transformer model from scratch. You can refer to the complete code in this collab: Link to code

Leave a Reply