Complete Tutorial: Building the Full Transformer Model from Scratch in PyTorch
Learn how to implement every component of the Transformer architecture — the model that powers modern AI like ChatGPT, BERT, and Google Translate. This step-by-step guide explains the theory behind each part and provides clean, runnable PyTorch code.
Keywords: Transformer from scratch, PyTorch Transformer tutorial, Attention is All You Need, self-attention mechanism, multi-head attention, positional encoding, encoder decoder Transformer.
Published: April 2026 | Reading time: 25 minutes | Difficulty: Intermediate (Python + PyTorch basics recommended)
Introduction to the Transformer Architecture
The Transformer was introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. It revolutionized NLP (and later vision, audio, etc.) by replacing recurrent networks (RNNs/LSTMs) with a fully attention-based design.
Why Transformers Beat RNNs
- Parallelization: Process entire sequences at once (no sequential bottlenecks).
- Long-range dependencies: Attention captures relationships between distant tokens easily.
- Scalability: Easier to train on massive datasets.
The original Transformer uses an encoder-decoder structure:
- Encoder: Processes input sequence (e.g., English sentence).
- Decoder: Generates output sequence (e.g., French translation), attending to both its own previous outputs and the encoder's output.
Each consists of stacked identical layers with self-attention, feed-forward networks, residual connections, and layer normalization.
In this tutorial, we'll build:
- Input Embeddings
- Positional Encoding
- Scaled Dot-Product Attention
- Multi-Head Attention
- Feed-Forward Network
- Encoder Layer
- Decoder Layer
- Full Transformer Model
We'll use PyTorch for implementation.
1. Input Embeddings
Words/tokens must be converted to dense vectors. We use nn.Embedding for this.
import torch
import torch.nn as nn
import math
class InputEmbeddings(nn.Module):
def __init__(self, d_model: int, vocab_size: int):
super().__init__()
self.d_model = d_model
self.embedding = nn.Embedding(vocab_size, d_model)
def forward(self, x):
# x shape: (batch_size, seq_len)
return self.embedding(x) * math.sqrt(self.d_model) # Scaling helps with gradients
Explanation: Multiplying by √d_model prevents embeddings from becoming too small after adding positional encodings.
2. Positional Encoding
Transformers have no recurrence or convolution, so they need explicit position information.
The original paper uses sine and cosine functions:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
class PositionalEncoding(nn.Module):
def __init__(self, d_model: int, max_seq_length: int = 5000, dropout: float = 0.1):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
# Create positional encoding matrix
pe = torch.zeros(max_seq_length, d_model)
position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term) # even dimensions
pe[:, 1::2] = torch.cos(position * div_term) # odd dimensions
pe = pe.unsqueeze(0) # (1, max_seq_length, d_model)
self.register_buffer('pe', pe) # Not a parameter, but saved with model
def forward(self, x):
# x shape: (batch_size, seq_len, d_model)
x = x + self.pe[:, :x.size(1)]
return self.dropout(x)
Why sine/cosine? They allow the model to easily learn relative positions through linear transformations.
3. Scaled Dot-Product Attention
The core of Transformers. It computes attention weights between queries (Q), keys (K), and values (V).
Formula:
Attention(Q, K, V) = softmax( (Q K^T) / √d_k ) V
def scaled_dot_product_attention(query, key, value, mask=None, dropout=None):
d_k = query.size(-1)
# Compute scores: (batch, heads, seq_len_q, seq_len_k)
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn = torch.softmax(scores, dim=-1)
if dropout is not None:
attn = dropout(attn)
output = torch.matmul(attn, value)
return output, attn
Masking: Used in decoder for causal (autoregressive) generation — prevent attending to future tokens.
4. Multi-Head Attention
Instead of one attention, we run multiple in parallel ("heads") and concatenate.
Each head learns different relationships.
class MultiHeadAttention(nn.Module):
def __init__(self, d_model: int, num_heads: int, dropout: float = 0.1):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.w_q = nn.Linear(d_model, d_model)
self.w_k = nn.Linear(d_model, d_model)
self.w_v = nn.Linear(d_model, d_model)
self.w_o = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(p=dropout)
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
# Linear projections and reshape to (batch, heads, seq_len, d_k)
Q = self.w_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.w_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.w_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
# Scaled dot-product attention
attn_output, attn_weights = scaled_dot_product_attention(Q, K, V, mask, self.dropout)
# Concatenate heads and project
attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
return self.w_o(attn_output)
Three types in full Transformer:
- Encoder: Self-attention (Q=K=V from input)
- Decoder: Masked self-attention (on target) + Encoder-Decoder attention (Q from decoder, K/V from encoder)
5. Position-wise Feed-Forward Network
Simple two-layer MLP applied independently to each position.
class FeedForward(nn.Module):
def __init__(self, d_model: int, d_ff: int = 2048, dropout: float = 0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(p=dropout)
self.activation = nn.ReLU()
def forward(self, x):
return self.linear2(self.dropout(self.activation(self.linear1(x))))
6. Encoder Layer
Each encoder layer has:
- Multi-head self-attention
- Feed-forward
- Residual connections + LayerNorm around each
class EncoderLayer(nn.Module):
def __init__(self, d_model: int, num_heads: int, d_ff: int = 2048, dropout: float = 0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
self.feed_forward = FeedForward(d_model, d_ff, dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(p=dropout)
def forward(self, x, src_mask=None):
# Self-attention sublayer
attn_output = self.self_attn(x, x, x, src_mask)
x = x + self.dropout(attn_output)
x = self.norm1(x)
# Feed-forward sublayer
ff_output = self.feed_forward(x)
x = x + self.dropout(ff_output)
x = self.norm2(x)
return x
Residual connection: x + sublayer(x) helps with gradient flow in deep networks.
7. Decoder Layer
Decoder layer has three sublayers:
- Masked self-attention
- Encoder-decoder attention
- Feed-forward
class DecoderLayer(nn.Module):
def __init__(self, d_model: int, num_heads: int, d_ff: int = 2048, dropout: float = 0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
self.cross_attn = MultiHeadAttention(d_model, num_heads, dropout)
self.feed_forward = FeedForward(d_model, d_ff, dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(p=dropout)
def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
# Masked self-attention
self_attn_output = self.self_attn(x, x, x, tgt_mask)
x = x + self.dropout(self_attn_output)
x = self.norm1(x)
# Cross-attention (attend to encoder output)
cross_attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
x = x + self.dropout(cross_attn_output)
x = self.norm2(x)
# Feed-forward
ff_output = self.feed_forward(x)
x = x + self.dropout(ff_output)
x = self.norm3(x)
return x
tgt_mask: Prevents decoder from seeing future tokens during training.
8. Full Transformer Model
Putting it all together.
class Transformer(nn.Module):
def __init__(self, src_vocab_size: int, tgt_vocab_size: int, d_model: int = 512,
num_heads: int = 8, num_layers: int = 6, d_ff: int = 2048,
max_seq_length: int = 5000, dropout: float = 0.1):
super().__init__()
self.src_embedding = InputEmbeddings(d_model, src_vocab_size)
self.tgt_embedding = InputEmbeddings(d_model, tgt_vocab_size)
self.pos_encoding = PositionalEncoding(d_model, max_seq_length, dropout)
self.encoder_layers = nn.ModuleList([
EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)
])
self.decoder_layers = nn.ModuleList([
DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)
])
self.norm = nn.LayerNorm(d_model)
self.final_linear = nn.Linear(d_model, tgt_vocab_size)
def encode(self, src, src_mask):
src = self.src_embedding(src)
src = self.pos_encoding(src)
for layer in self.encoder_layers:
src = layer(src, src_mask)
return self.norm(src)
def decode(self, tgt, enc_output, src_mask, tgt_mask):
tgt = self.tgt_embedding(tgt)
tgt = self.pos_encoding(tgt)
for layer in self.decoder_layers:
tgt = layer(tgt, enc_output, src_mask, tgt_mask)
return self.norm(tgt)
def forward(self, src, tgt, src_mask, tgt_mask):
enc_output = self.encode(src, src_mask)
dec_output = self.decode(tgt, enc_output, src_mask, tgt_mask)
return self.final_linear(dec_output)
Helper Functions: Mask Creation
def create_src_mask(src):
# src: (batch, seq_len)
return (src != 0).unsqueeze(1).unsqueeze(2) # padding mask example
def create_tgt_mask(tgt):
# Causal mask + padding mask
seq_len = tgt.size(1)
causal_mask = torch.tril(torch.ones((seq_len, seq_len))).bool().to(tgt.device)
padding_mask = (tgt != 0).unsqueeze(1).unsqueeze(2)
return causal_mask & padding_mask
How to Use the Model (Example)
# Hyperparameters (base model from paper)
d_model = 512
num_heads = 8
num_layers = 6
model = Transformer(src_vocab_size=10000, tgt_vocab_size=10000,
d_model=d_model, num_heads=num_heads, num_layers=num_layers)
# Dummy data
src = torch.randint(0, 10000, (32, 50)) # batch=32, seq_len=50
tgt = torch.randint(0, 10000, (32, 49)) # shifted for teacher forcing
src_mask = create_src_mask(src)
tgt_mask = create_tgt_mask(tgt)
output = model(src, tgt, src_mask, tgt_mask)
print(output.shape) # (32, 49, 10000)
Training tip: Use teacher forcing during training (feed ground-truth previous tokens to decoder). For inference, use autoregressive generation with greedy or beam search.
Next Steps & Improvements
- Add training loop with cross-entropy loss (ignore padding).
- Use learning rate scheduler (warmup + decay) as in the paper.
- Experiment with decoder-only (like GPT) or encoder-only (like BERT) variants.
- Add label smoothing for better generalization.
- Scale up: More layers, larger d_model, bigger datasets.
Common Pitfalls & Debugging Tips
- Shape mismatches in attention (use
.transpose()carefully). - Forgetting to scale embeddings or apply masks.
- Gradient vanishing/exploding → LayerNorm and residuals help.
- High memory usage → Start with small batch size and seq_len.
This implementation closely follows the original "Attention Is All You Need" paper while being clean and educational.
Full code repository style: You can combine all classes into one file and train on a translation dataset (e.g., IWSLT or WMT via Hugging Face Datasets).
Want to extend this to a decoder-only model for text generation? Or add RoPE (Rotary Positional Embeddings) for better length generalization? Let me know in the comments!
References & Further Reading
- Original Paper: "Attention Is All You Need" (Vaswani et al., 2017)
- PyTorch
nn.Transformer(for comparison) - Illustrated Transformer (Jay Alammar's blog — highly recommended visuals)
Share this tutorial if it helped you understand and code the Transformer from scratch!