Complete Tutorial: Building the Full Transformer Model from Scratch in PyTorch

Tech3Space08 Apr 2026

Learn how to implement every component of the Transformer architecture — the model that powers modern AI like ChatGPT, BERT, and Google Translate. This step-by-step guide explains the theory behind each part and provides clean, runnable PyTorch code.

Keywords: Transformer from scratch, PyTorch Transformer tutorial, Attention is All You Need, self-attention mechanism, multi-head attention, positional encoding, encoder decoder Transformer.

Published: April 2026 | Reading time: 25 minutes | Difficulty: Intermediate (Python + PyTorch basics recommended)

Introduction to the Transformer Architecture

The Transformer was introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. It revolutionized NLP (and later vision, audio, etc.) by replacing recurrent networks (RNNs/LSTMs) with a fully attention-based design.

Why Transformers Beat RNNs

Parallelization: Process entire sequences at once (no sequential bottlenecks).
Long-range dependencies: Attention captures relationships between distant tokens easily.
Scalability: Easier to train on massive datasets.

The original Transformer uses an encoder-decoder structure:

Encoder: Processes input sequence (e.g., English sentence).
Decoder: Generates output sequence (e.g., French translation), attending to both its own previous outputs and the encoder's output.

Each consists of stacked identical layers with self-attention, feed-forward networks, residual connections, and layer normalization.

In this tutorial, we'll build:

Input Embeddings
Positional Encoding
Scaled Dot-Product Attention
Multi-Head Attention
Feed-Forward Network
Encoder Layer
Decoder Layer
Full Transformer Model

We'll use PyTorch for implementation.

1. Input Embeddings

Words/tokens must be converted to dense vectors. We use nn.Embedding for this.

import torch
import torch.nn as nn
import math

class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)
    
    def forward(self, x):
        # x shape: (batch_size, seq_len)
        return self.embedding(x) * math.sqrt(self.d_model)  # Scaling helps with gradients

Explanation: Multiplying by √d_model prevents embeddings from becoming too small after adding positional encodings.

2. Positional Encoding

Transformers have no recurrence or convolution, so they need explicit position information.

The original paper uses sine and cosine functions:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, max_seq_length: int = 5000, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Create positional encoding matrix
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)  # even dimensions
        pe[:, 1::2] = torch.cos(position * div_term)  # odd dimensions
        
        pe = pe.unsqueeze(0)  # (1, max_seq_length, d_model)
        self.register_buffer('pe', pe)  # Not a parameter, but saved with model
    
    def forward(self, x):
        # x shape: (batch_size, seq_len, d_model)
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

Why sine/cosine? They allow the model to easily learn relative positions through linear transformations.

3. Scaled Dot-Product Attention

The core of Transformers. It computes attention weights between queries (Q), keys (K), and values (V).

Formula:
Attention(Q, K, V) = softmax( (Q K^T) / √d_k ) V

def scaled_dot_product_attention(query, key, value, mask=None, dropout=None):
    d_k = query.size(-1)
    
    # Compute scores: (batch, heads, seq_len_q, seq_len_k)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    
    attn = torch.softmax(scores, dim=-1)
    
    if dropout is not None:
        attn = dropout(attn)
    
    output = torch.matmul(attn, value)
    return output, attn

Masking: Used in decoder for causal (autoregressive) generation — prevent attending to future tokens.

4. Multi-Head Attention

Instead of one attention, we run multiple in parallel ("heads") and concatenate.

Each head learns different relationships.

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int, dropout: float = 0.1):
        super().__init__()
        assert d_model % num_heads == 0
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)
        
        self.dropout = nn.Dropout(p=dropout)
    
    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        
        # Linear projections and reshape to (batch, heads, seq_len, d_k)
        Q = self.w_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.w_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.w_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Scaled dot-product attention
        attn_output, attn_weights = scaled_dot_product_attention(Q, K, V, mask, self.dropout)
        
        # Concatenate heads and project
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        
        return self.w_o(attn_output)

Three types in full Transformer:

Encoder: Self-attention (Q=K=V from input)
Decoder: Masked self-attention (on target) + Encoder-Decoder attention (Q from decoder, K/V from encoder)

5. Position-wise Feed-Forward Network

Simple two-layer MLP applied independently to each position.

class FeedForward(nn.Module):
    def __init__(self, d_model: int, d_ff: int = 2048, dropout: float = 0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(p=dropout)
        self.activation = nn.ReLU()
    
    def forward(self, x):
        return self.linear2(self.dropout(self.activation(self.linear1(x))))

6. Encoder Layer

Each encoder layer has:

Multi-head self-attention
Feed-forward
Residual connections + LayerNorm around each

class EncoderLayer(nn.Module):
    def __init__(self, d_model: int, num_heads: int, d_ff: int = 2048, dropout: float = 0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(p=dropout)
    
    def forward(self, x, src_mask=None):
        # Self-attention sublayer
        attn_output = self.self_attn(x, x, x, src_mask)
        x = x + self.dropout(attn_output)
        x = self.norm1(x)
        
        # Feed-forward sublayer
        ff_output = self.feed_forward(x)
        x = x + self.dropout(ff_output)
        x = self.norm2(x)
        
        return x

Residual connection: x + sublayer(x) helps with gradient flow in deep networks.

7. Decoder Layer

Decoder layer has three sublayers:

Masked self-attention
Encoder-decoder attention
Feed-forward

class DecoderLayer(nn.Module):
    def __init__(self, d_model: int, num_heads: int, d_ff: int = 2048, dropout: float = 0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads, dropout)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(p=dropout)
    
    def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
        # Masked self-attention
        self_attn_output = self.self_attn(x, x, x, tgt_mask)
        x = x + self.dropout(self_attn_output)
        x = self.norm1(x)
        
        # Cross-attention (attend to encoder output)
        cross_attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = x + self.dropout(cross_attn_output)
        x = self.norm2(x)
        
        # Feed-forward
        ff_output = self.feed_forward(x)
        x = x + self.dropout(ff_output)
        x = self.norm3(x)
        
        return x

tgt_mask: Prevents decoder from seeing future tokens during training.

8. Full Transformer Model

Putting it all together.

class Transformer(nn.Module):
    def __init__(self, src_vocab_size: int, tgt_vocab_size: int, d_model: int = 512,
                 num_heads: int = 8, num_layers: int = 6, d_ff: int = 2048,
                 max_seq_length: int = 5000, dropout: float = 0.1):
        super().__init__()
        
        self.src_embedding = InputEmbeddings(d_model, src_vocab_size)
        self.tgt_embedding = InputEmbeddings(d_model, tgt_vocab_size)
        self.pos_encoding = PositionalEncoding(d_model, max_seq_length, dropout)
        
        self.encoder_layers = nn.ModuleList([
            EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)
        ])
        self.decoder_layers = nn.ModuleList([
            DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)
        ])
        
        self.norm = nn.LayerNorm(d_model)
        self.final_linear = nn.Linear(d_model, tgt_vocab_size)
    
    def encode(self, src, src_mask):
        src = self.src_embedding(src)
        src = self.pos_encoding(src)
        for layer in self.encoder_layers:
            src = layer(src, src_mask)
        return self.norm(src)
    
    def decode(self, tgt, enc_output, src_mask, tgt_mask):
        tgt = self.tgt_embedding(tgt)
        tgt = self.pos_encoding(tgt)
        for layer in self.decoder_layers:
            tgt = layer(tgt, enc_output, src_mask, tgt_mask)
        return self.norm(tgt)
    
    def forward(self, src, tgt, src_mask, tgt_mask):
        enc_output = self.encode(src, src_mask)
        dec_output = self.decode(tgt, enc_output, src_mask, tgt_mask)
        return self.final_linear(dec_output)

Helper Functions: Mask Creation

def create_src_mask(src):
    # src: (batch, seq_len)
    return (src != 0).unsqueeze(1).unsqueeze(2)  # padding mask example

def create_tgt_mask(tgt):
    # Causal mask + padding mask
    seq_len = tgt.size(1)
    causal_mask = torch.tril(torch.ones((seq_len, seq_len))).bool().to(tgt.device)
    padding_mask = (tgt != 0).unsqueeze(1).unsqueeze(2)
    return causal_mask & padding_mask

How to Use the Model (Example)

# Hyperparameters (base model from paper)
d_model = 512
num_heads = 8
num_layers = 6

model = Transformer(src_vocab_size=10000, tgt_vocab_size=10000,
                    d_model=d_model, num_heads=num_heads, num_layers=num_layers)

# Dummy data
src = torch.randint(0, 10000, (32, 50))  # batch=32, seq_len=50
tgt = torch.randint(0, 10000, (32, 49))  # shifted for teacher forcing

src_mask = create_src_mask(src)
tgt_mask = create_tgt_mask(tgt)

output = model(src, tgt, src_mask, tgt_mask)
print(output.shape)  # (32, 49, 10000)

Training tip: Use teacher forcing during training (feed ground-truth previous tokens to decoder). For inference, use autoregressive generation with greedy or beam search.

Next Steps & Improvements

Add training loop with cross-entropy loss (ignore padding).
Use learning rate scheduler (warmup + decay) as in the paper.
Experiment with decoder-only (like GPT) or encoder-only (like BERT) variants.
Add label smoothing for better generalization.
Scale up: More layers, larger d_model, bigger datasets.

Common Pitfalls & Debugging Tips

Shape mismatches in attention (use .transpose() carefully).
Forgetting to scale embeddings or apply masks.
Gradient vanishing/exploding → LayerNorm and residuals help.
High memory usage → Start with small batch size and seq_len.

This implementation closely follows the original "Attention Is All You Need" paper while being clean and educational.

Full code repository style: You can combine all classes into one file and train on a translation dataset (e.g., IWSLT or WMT via Hugging Face Datasets).

Want to extend this to a decoder-only model for text generation? Or add RoPE (Rotary Positional Embeddings) for better length generalization? Let me know in the comments!

References & Further Reading

Original Paper: "Attention Is All You Need" (Vaswani et al., 2017)
PyTorch nn.Transformer (for comparison)
Illustrated Transformer (Jay Alammar's blog — highly recommended visuals)

Share this tutorial if it helped you understand and code the Transformer from scratch!