Training a **high-quality LLM** depends on **many factors**, not just the positional encoding. The architecture, data, optimization, and compute budget all matter.
Tech3Space14 Jun 2026
Training a high-quality LLM depends on many factors, not just the positional encoding. The architecture, data, optimization, and compute budget all matter.
1. Which positional encoding is best for high-quality models?
| Method | Model Quality | Memory Usage | Compute Cost | Long Context | Typical Use |
|---|---|---|---|---|---|
| Learned Positional Embeddings | Good | Medium | Medium | ❌ Limited | GPT-2, GPT-Neo |
| Sinusoidal | Good | Low | Low | Moderate | Original Transformer |
| Relative Positional Encoding | Very Good | Medium–High | Medium–High | Good | T5 |
| ALiBi | Very Good | Low | Low | Excellent | Some long-context models |
| RoPE | Excellent | Low | Low | Excellent | Llama, Qwen, Mistral, Gemma |
| xPos | Excellent | Low | Medium | Excellent | Specialized research |
For most modern decoder-only LLMs, RoPE is a common choice because it offers strong performance without large memory overhead.
2. Which components use the most memory?
(a) Multi-Head Attention (MHA)
1Q1 K1 V1 2Q2 K2 V2 3Q3 K3 V3 4Q4 K4 V4
- Stores separate keys and values for every head.
- High memory usage.
- High compute cost.
Memory: 🔴 High
(b) Grouped Query Attention (GQA)
1Q1 Q2 -> K1 V1 2Q3 Q4 -> K2 V2
- Shares keys and values across groups of query heads.
- Lower memory usage than MHA.
Memory: 🟢 Lower
(c) Multi-Query Attention (MQA)
1Q1 2Q2 3Q3 4Q4 5 │ 6Shared K 7Shared V
- Shares a single set of keys and values.
- Very memory efficient.
Memory: 🟢 Lowest
3. Which feed-forward activation is heavier?
| Activation | Quality | Compute | Memory | Common Use |
|---|---|---|---|---|
| Sigmoid | Lower as a main FFN activation | Low | Low | Gates, binary outputs |
| GELU | High | Medium | Medium | GPT-Neo, BERT |
| SwiGLU | Very High | Higher | Higher | Llama, Qwen, Mistral |
SwiGLU typically uses more parameters in the feed-forward block than a simple GELU layer, but it often provides better modeling capacity.
4. Which normalization is heavier?
| Normalization | Memory | Compute | Modern Usage |
|---|---|---|---|
| LayerNorm | Medium | Medium | Older and many existing models |
| RMSNorm | Slightly lower | Slightly lower | Common in recent decoder-only LLMs |
RMSNorm is often chosen for its simplicity and efficiency.
5. Overall architecture comparison
| Component | Older approach | Modern approach | Relative weight |
|---|---|---|---|
| Positional Encoding | Learned | RoPE | RoPE is efficient |
| Attention | MHA | GQA | GQA is lighter |
| Feed-forward | GELU | SwiGLU | SwiGLU is somewhat heavier |
| Normalization | LayerNorm | RMSNorm | RMSNorm is slightly lighter |
6. A common modern LLM recipe
Many recent open-weight models use something close to:
1Input 2 │ 3Token Embeddings 4 │ 5RoPE 6 │ 7RMSNorm 8 │ 9Grouped Query Attention (GQA) 10 │ 11Residual Connection 12 │ 13RMSNorm 14 │ 15SwiGLU Feed-Forward Network 16 │ 17Residual Connection 18 │ 19Repeat for many decoder layers 20 │ 21Output Head
This style is used because it balances quality, inference speed, and memory efficiency.
Final takeaway
- Best positional encoding for many modern LLMs: ✅ RoPE
- Most memory-intensive attention: 🔴 Multi-Head Attention (MHA)
- More memory-efficient attention: ✅ Grouped Query Attention (GQA) or Multi-Query Attention (MQA)
- Most capable feed-forward activation in many recent models: ✅ SwiGLU (at the cost of somewhat higher compute and parameter usage than GELU)
- Preferred normalization in many recent decoder-only LLMs: ✅ RMSNorm