components of Generative AI

3 days ago

csemachine learning

Here’s an expanded explanation of the key components of Generative AI, now including simple, runnable Python code examples for many of them (mostly using PyTorch or popular libraries like tiktoken, sentence-transformers, and minimal from-scratch implementations).

These examples are kept minimal and educational — they show the core idea in working condition, not full production-scale training.

1. Data (Training Corpus)

Huge text/image/code datasets.

No code example needed — but imagine loading billions of tokens from Common Crawl, The Pile, GitHub, LAION-5B, etc.

2. Tokenization

Breaking text → list of token IDs.

# pip install tiktoken
import tiktoken

# GPT-4 / cl100k_base tokenizer (very common in 2025–2026)
encoding = tiktoken.get_encoding("cl100k_base")

text = "Generative AI is transforming technology in 2026!"

tokens = encoding.encode(text)
print("Tokens (IDs)    :", tokens)
print("Token count     :", len(tokens))
print("Decoded back    :", encoding.decode(tokens))
print("Decoded tokens  :", [encoding.decode([t]) for t in tokens])

# Example output:
# Tokens (IDs)    : [48609, 315, 15592, 374, 18258, 4769, 304, 220, 451, 220, 605, 0]
# Token count     : 12
# Decoded back    : Generative AI is transforming technology in 2026!

3. Embeddings

Tokens → dense vectors (capturing meaning).

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer("all-MiniLM-L6-v2")   # ~80 MB, fast & good quality

sentences = [
    "The king is strong",
    "The queen is powerful",
    "Apple is a fruit",
    "Apple released iPhone 17"
]

embeddings = model.encode(sentences)   # shape: (4, 384)

print("Embedding shape       :", embeddings.shape)
print("Similarity king ↔ queen:", torch.nn.functional.cosine_similarity(
    torch.tensor(embeddings[0]), torch.tensor(embeddings[1]), dim=0
).item())   # usually ~0.65–0.75

# Same word, different context → different embeddings in contextual models

4–5. Neural Networks + Attention Mechanisms (Transformer core)

Minimal self-attention from scratch (very simplified):

import torch
import torch.nn.functional as F

def simple_self_attention(x):
    # x shape: (batch=1, seq_len, d_model)
    d_k = x.size(-1)
    
    # In real models: three separate projections Q, K, V
    Q = K = V = x   # naive for demo
    
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)   # scaled dot-product
    attn_weights = F.softmax(scores, dim=-1)
    
    output = torch.matmul(attn_weights, V)
    return output

# Tiny example
torch.manual_seed(42)
x = torch.randn(1, 5, 64)   # 5 tokens, 64-dim embeddings
out = simple_self_attention(x)
print("Input shape :", x.shape)
print("Output shape:", out.shape)   # same shape

Component	Typical Library/Tool (2026)	Key Idea in Code
Tokenization	tiktoken, sentencepiece, HF	text → [40, 3021, 2956, ...]
Embeddings	sentence-transformers, torch.nn	token id → 384–8192 dim vector
Attention	torch.nn.MultiheadAttention	Q·Kᵀ / √d → softmax → weighted V
Training	PyTorch + AdamW + CrossEntropy	predict shifted tokens
Decoding	custom sampling loop	multinomial(probs / temperature)
Fine-tuning	peft (LoRA), trl (SFT)	update small adapter weights

components of Generative AI

1. Data (Training Corpus)

2. Tokenization

3. Embeddings

4–5. Neural Networks + Attention Mechanisms (Transformer core)

6. Training (Next-token prediction – core objective)

7. Parameters

8. Decoding (Generation with sampling)

9. Fine-tuning

10. Prompting

11. Inference

12. Safety

13. Evaluation

14. Deployment

Quick Summary Table (2026 perspective)