Training a high-quality LLM depends on many factors, not just the positional encoding. The architecture, data, optimization, and compute budget all matter.

Tech3Space14 Jun 2026

Training a high-quality LLM depends on many factors, not just the positional encoding. The architecture, data, optimization, and compute budget all matter.

1. Which positional encoding is best for high-quality models?

Method	Model Quality	Memory Usage	Compute Cost	Long Context	Typical Use
Learned Positional Embeddings	Good	Medium	Medium	❌ Limited	GPT-2, GPT-Neo
Sinusoidal	Good	Low	Low	Moderate	Original Transformer
Relative Positional Encoding	Very Good	Medium–High	Medium–High	Good	T5
ALiBi	Very Good	Low	Low	Excellent	Some long-context models
RoPE	Excellent	Low	Low	Excellent	Llama, Qwen, Mistral, Gemma
xPos	Excellent	Low	Medium	Excellent	Specialized research

For most modern decoder-only LLMs, RoPE is a common choice because it offers strong performance without large memory overhead.

2. Which components use the most memory?

(a) Multi-Head Attention (MHA)

text
1Q1 K1 V1
2Q2 K2 V2
3Q3 K3 V3
4Q4 K4 V4

Stores separate keys and values for every head.
High memory usage.
High compute cost.

Memory: 🔴 High

(b) Grouped Query Attention (GQA)

text
1Q1 Q2 -> K1 V1
2Q3 Q4 -> K2 V2

Shares keys and values across groups of query heads.
Lower memory usage than MHA.

Memory: 🟢 Lower

(c) Multi-Query Attention (MQA)

text
1Q1
2Q2
3Q3
4Q4
5  │
6Shared K
7Shared V

Shares a single set of keys and values.
Very memory efficient.

Memory: 🟢 Lowest

3. Which feed-forward activation is heavier?

Activation	Quality	Compute	Memory	Common Use
Sigmoid	Lower as a main FFN activation	Low	Low	Gates, binary outputs
GELU	High	Medium	Medium	GPT-Neo, BERT
SwiGLU	Very High	Higher	Higher	Llama, Qwen, Mistral

SwiGLU typically uses more parameters in the feed-forward block than a simple GELU layer, but it often provides better modeling capacity.

4. Which normalization is heavier?

Normalization	Memory	Compute	Modern Usage
LayerNorm	Medium	Medium	Older and many existing models
RMSNorm	Slightly lower	Slightly lower	Common in recent decoder-only LLMs

RMSNorm is often chosen for its simplicity and efficiency.

5. Overall architecture comparison

Component	Older approach	Modern approach	Relative weight
Positional Encoding	Learned	RoPE	RoPE is efficient
Attention	MHA	GQA	GQA is lighter
Feed-forward	GELU	SwiGLU	SwiGLU is somewhat heavier
Normalization	LayerNorm	RMSNorm	RMSNorm is slightly lighter

6. A common modern LLM recipe

Many recent open-weight models use something close to:

text
1Input
2   │
3Token Embeddings
4   │
5RoPE
6   │
7RMSNorm
8   │
9Grouped Query Attention (GQA)
10   │
11Residual Connection
12   │
13RMSNorm
14   │
15SwiGLU Feed-Forward Network
16   │
17Residual Connection
18   │
19Repeat for many decoder layers
20   │
21Output Head

This style is used because it balances quality, inference speed, and memory efficiency.

Final takeaway

Best positional encoding for many modern LLMs: ✅ RoPE
Most memory-intensive attention: 🔴 Multi-Head Attention (MHA)
More memory-efficient attention: ✅ Grouped Query Attention (GQA) or Multi-Query Attention (MQA)
Most capable feed-forward activation in many recent models: ✅ SwiGLU (at the cost of somewhat higher compute and parameter usage than GELU)
Preferred normalization in many recent decoder-only LLMs: ✅ RMSNorm

Training a **high-quality LLM** depends on **many factors**, not just the positional encoding. The architecture, data, optimization, and compute budget all matter.