Differences in Decoder-Only Transformer Architectures**
Tech3Space03 Jun 2026
✅ Complete Answer: Differences in Decoder-Only Transformer Architectures
1. Overview of Decoder-Only Architecture
All modern LLMs (GPT, LLaMA, Mistral, etc.) are Decoder-only Transformers.
They only use the Decoder stack (no Encoder). Each layer has:
- Self-Attention (Causal / Masked)
- Feed-Forward Network (FFN)
- Residual Connections + Normalization
The main differences across models lie in optimizations for speed, memory, performance, and efficiency.
2. Key Architectural Innovations Comparison
| Model | Main Innovation | Normalization | Activation | Attention Type | Positional Encoding | Other Key Features |
|---|---|---|---|---|---|---|
| GPT | Original Decoder-Only | LayerNorm | GELU | Multi-Head Attention | Learned Absolute | Basic Transformer |
| LLaMA | RMSNorm + SwiGLU | RMSNorm | SwiGLU | Multi-Head | RoPE | Pre-normalization, Large FFN |
| Falcon | Multi-Query Attention (MQA) | LayerNorm | GELU | Multi-Query | RoPE | Strong MQA for faster inference |
| Mistral | Sliding Window + GQA + RoPE | RMSNorm | SwiGLU | Grouped-Query (GQA) | RoPE | Sliding Window Attention |
| Mixtral | Mixture of Experts (MoE) | RMSNorm | SwiGLU | Grouped-Query | RoPE | Sparse MoE (8 experts, 2 active) |
| Qwen | Long Context + Dynamic NTK | RMSNorm | SwiGLU | GQA | RoPE + Dynamic | Excellent long context |
| Phi | Small High-Quality Models | LayerNorm/RMS | GELU/SwiGLU | Multi-Head | RoPE | High-quality training data focus |
| Gemma | Lightweight + Efficient | RMSNorm | GeGLU | Multi-Head/GQA | RoPE | Optimized for on-device |
3. Major Code & Algorithm Differences
A. Normalization
- GPT (LayerNorm): Mean + Std normalization
- LLaMA / Mistral (RMSNorm): Only Root Mean Square (faster, no mean subtraction)
# GPT Style
class LayerNorm(nn.Module):
def forward(self, x):
return F.layer_norm(x, self.normalized_shape, self.weight, self.bias)
# Mistral / LLaMA Style (Faster)
class RMSNorm(nn.Module):
def forward(self, x):
rms = torch.sqrt(torch.mean(x.pow(2), dim=-1, keepdim=True) + eps)
return x / rms * self.weight
B. Feed Forward Network
- GPT: GELU + Linear
- LLaMA / Mistral: SwiGLU (much better performance)
# GPT Style
class FFN(nn.Module):
def forward(self, x):
return self.down_proj(F.gelu(self.up_proj(x)))
# Mistral / LLaMA Style - SwiGLU
class MistralMLP(nn.Module):
def forward(self, x):
gate = F.silu(self.gate_proj(x)) # Swish(SiLU)
up = self.up_proj(x)
return self.down_proj(gate * up) # Element-wise multiplication
C. Attention Mechanisms
- Standard Multi-Head Attention (GPT, LLaMA)
- Multi-Query Attention (Falcon) — Only 1 Key/Value head for all Query heads
- Grouped-Query Attention (Mistral) — 8 KV heads for 32 Query heads (best balance)
# Falcon: Multi-Query (Very memory efficient)
k = self.k_proj(x).view(bs, seq, 1, head_dim) # Only 1 KV head
v = self.v_proj(x).view(bs, seq, 1, head_dim)
# Mistral: Grouped-Query Attention (GQA)
k = self.k_proj(x).view(bs, seq, num_kv_heads, head_dim) # e.g., 8 KV heads
k = k.repeat_interleave(num_heads // num_kv_heads, dim=2) # Repeat to match Q heads
D. Rotary Embeddings (RoPE) - Used in LLaMA, Mistral, etc.
# RoPE (Much better than absolute position)
def apply_rotary_emb(q, k, cos, sin):
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
E. Sliding Window Attention (Mistral Specific)
Limits attention to last N tokens (e.g., 4096) instead of full sequence → saves memory for long context.
# In MistralAttention forward():
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
# Can be modified to sliding window:
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=self.sliding_window)
F. Mixture of Experts (Mixtral)
class MoELayer(nn.Module):
def __init__(self):
self.experts = nn.ModuleList([MistralMLP(...) for _ in range(8)])
self.router = nn.Linear(hidden_size, 8) # Choose top-2 experts
def forward(self, x):
scores = self.router(x)
topk_weights, topk_indices = torch.topk(scores, k=2, dim=-1)
# Dispatch tokens to selected experts (very complex routing)
4. Complete Code Comparison Structure
I can give you a modular base that can switch between architectures:
class TransformerDecoderLayer(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
# Choose Normalization
self.norm1 = RMSNorm(config.hidden_size) if config.use_rms else nn.LayerNorm(config.hidden_size)
# Choose Attention
if config.attention_type == "gqa":
self.self_attn = MistralAttention(config) # GQA + RoPE
elif config.attention_type == "mqa":
self.self_attn = FalconAttention(config) # MQA
else:
self.self_attn = StandardMultiHeadAttention(config)
# Choose FFN
if config.use_swiglu:
self.mlp = MistralMLP(config.hidden_size, config.intermediate_size)
else:
self.mlp = GPT_FFN(config)