Differences in Decoder-Only Transformer Architectures**

Tech3Space03 Jun 2026

✅ Complete Answer: Differences in Decoder-Only Transformer Architectures

1. Overview of Decoder-Only Architecture

All modern LLMs (GPT, LLaMA, Mistral, etc.) are Decoder-only Transformers.
They only use the Decoder stack (no Encoder). Each layer has:

Self-Attention (Causal / Masked)
Feed-Forward Network (FFN)
Residual Connections + Normalization

The main differences across models lie in optimizations for speed, memory, performance, and efficiency.

2. Key Architectural Innovations Comparison

Model	Main Innovation	Normalization	Activation	Attention Type	Positional Encoding	Other Key Features
GPT	Original Decoder-Only	LayerNorm	GELU	Multi-Head Attention	Learned Absolute	Basic Transformer
LLaMA	RMSNorm + SwiGLU	RMSNorm	SwiGLU	Multi-Head	RoPE	Pre-normalization, Large FFN
Falcon	Multi-Query Attention (MQA)	LayerNorm	GELU	Multi-Query	RoPE	Strong MQA for faster inference
Mistral	Sliding Window + GQA + RoPE	RMSNorm	SwiGLU	Grouped-Query (GQA)	RoPE	Sliding Window Attention
Mixtral	Mixture of Experts (MoE)	RMSNorm	SwiGLU	Grouped-Query	RoPE	Sparse MoE (8 experts, 2 active)
Qwen	Long Context + Dynamic NTK	RMSNorm	SwiGLU	GQA	RoPE + Dynamic	Excellent long context
Phi	Small High-Quality Models	LayerNorm/RMS	GELU/SwiGLU	Multi-Head	RoPE	High-quality training data focus
Gemma	Lightweight + Efficient	RMSNorm	GeGLU	Multi-Head/GQA	RoPE	Optimized for on-device

3. Major Code & Algorithm Differences

A. Normalization

GPT (LayerNorm): Mean + Std normalization
LLaMA / Mistral (RMSNorm): Only Root Mean Square (faster, no mean subtraction)

# GPT Style
class LayerNorm(nn.Module):
    def forward(self, x):
        return F.layer_norm(x, self.normalized_shape, self.weight, self.bias)

# Mistral / LLaMA Style (Faster)
class RMSNorm(nn.Module):
    def forward(self, x):
        rms = torch.sqrt(torch.mean(x.pow(2), dim=-1, keepdim=True) + eps)
        return x / rms * self.weight

B. Feed Forward Network

GPT: GELU + Linear
LLaMA / Mistral: SwiGLU (much better performance)

# GPT Style
class FFN(nn.Module):
    def forward(self, x):
        return self.down_proj(F.gelu(self.up_proj(x)))

# Mistral / LLaMA Style - SwiGLU
class MistralMLP(nn.Module):
    def forward(self, x):
        gate = F.silu(self.gate_proj(x))   # Swish(SiLU)
        up = self.up_proj(x)
        return self.down_proj(gate * up)   # Element-wise multiplication

C. Attention Mechanisms

Standard Multi-Head Attention (GPT, LLaMA)
Multi-Query Attention (Falcon) — Only 1 Key/Value head for all Query heads
Grouped-Query Attention (Mistral) — 8 KV heads for 32 Query heads (best balance)

# Falcon: Multi-Query (Very memory efficient)
k = self.k_proj(x).view(bs, seq, 1, head_dim)           # Only 1 KV head
v = self.v_proj(x).view(bs, seq, 1, head_dim)

# Mistral: Grouped-Query Attention (GQA)
k = self.k_proj(x).view(bs, seq, num_kv_heads, head_dim)   # e.g., 8 KV heads
k = k.repeat_interleave(num_heads // num_kv_heads, dim=2)  # Repeat to match Q heads

D. Rotary Embeddings (RoPE) - Used in LLaMA, Mistral, etc.

# RoPE (Much better than absolute position)
def apply_rotary_emb(q, k, cos, sin):
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

E. Sliding Window Attention (Mistral Specific)

Limits attention to last N tokens (e.g., 4096) instead of full sequence → saves memory for long context.

# In MistralAttention forward():
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
# Can be modified to sliding window:
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=self.sliding_window)

F. Mixture of Experts (Mixtral)

class MoELayer(nn.Module):
    def __init__(self):
        self.experts = nn.ModuleList([MistralMLP(...) for _ in range(8)])
        self.router = nn.Linear(hidden_size, 8)   # Choose top-2 experts

    def forward(self, x):
        scores = self.router(x)
        topk_weights, topk_indices = torch.topk(scores, k=2, dim=-1)
        # Dispatch tokens to selected experts (very complex routing)

4. Complete Code Comparison Structure

I can give you a modular base that can switch between architectures:

class TransformerDecoderLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        
        # Choose Normalization
        self.norm1 = RMSNorm(config.hidden_size) if config.use_rms else nn.LayerNorm(config.hidden_size)
        
        # Choose Attention
        if config.attention_type == "gqa":
            self.self_attn = MistralAttention(config)      # GQA + RoPE
        elif config.attention_type == "mqa":
            self.self_attn = FalconAttention(config)       # MQA
        else:
            self.self_attn = StandardMultiHeadAttention(config)
        
        # Choose FFN
        if config.use_swiglu:
            self.mlp = MistralMLP(config.hidden_size, config.intermediate_size)
        else:
            self.mlp = GPT_FFN(config)