Module 183

HBase Schema Design – Real-World Production Patterns (2025 Edition)

These are the exact patterns used today at Meta, Uber, Pinterest, Xiaomi, TikTok, JPMorgan, and every serious HBase deployment.

Golden Rule of HBase Schema Design (2025)

Tall-Narrow > Wide-Flat
→ Millions of columns per row > millions of rows with few columns

1. User Profile / Activity Feed (Meta, Pinterest, TikTok)

Use Case: Store user profile + last 10K actions (posts, likes, comments)

Component	Design Choice	Example RowKey	Column Family : Qualifier	Value
RowKey	`user_id` (fixed-width padded)	`0000012345`	—	—
CF: info	Static/slow-changing data	—	info:name	"Alice"
			info:email	"a@x.com"
CF: activity	Time-series events, newest first	—	activity:20251130_1845_click	post:998877
			activity:20251130_1830_like	post:112233
CF: counters	Fast increment (likes_count, followers_count)	—	counters:followers	154321

Why it works:

Single Get → entire recent activity + profile
Scan prefix 0000012345 → last N actions (reverse timestamp in qualifier)

2. Time-Series / IoT / Metrics (OpenTSDB Style – Used by Uber, Xiaomi)

Use Case: 1 billion metrics per day, 2-year retention

| Design: RowKey = metric_name + reverse_timestamp + device_id

Example RowKey	CF:data : Qualifier	Value
com.cpu.usage#1698796800#server-0001	data:2025-11-30T12:00:00	78.3
com.cpu.usage#1698796740#server-0001	data:2025-11-30T11:59:00	82.1

Better 2025 Design (Salt + Reverse Timestamp)
To avoid hotspotting on latest data:

RowKey = salt(0–99) + (Long.MAX_VALUE - timestamp) + metric + device_id
→ 07_9223370319574464000_com.cpu.usage_server-0001

Result: Even write distribution across all RegionServers

3. Messaging / Chat System (WhatsApp-like)

Use Case: Billions of messages, fetch conversation between two users

Pattern: Two tables (Inbox + Sent)

Table: messages_inbox

RowKey	CF:m : Qualifier	Value
user123#user456#9999999999	m:20251130_183000	"Hey!"
user123#user789#9999999988	m:20251130_182900	"How are you?"

Table: messages_sent (same structure, reverse user order)

Query: Conversation between A & B
→ Scan both tables with prefix user123#user456# and user456#user123# → merge in app

4. E-commerce Order History (Amazon-style)

Use Case: Fast lookup of all orders for a user + order details

Table: orders

RowKey	CF:o (order info)	CF:i (items)
user_000001234_20251130	o:status	"shipped"
	o:total	299.99
	i:item1	{"id": "B08XYZ", "qty": 2}
	i:item2	{"id": "A01ABC", "qty": 1}

RowKey pattern: user_{padded_id}_{reverse_date}
→ Natural clustering of recent orders

5. Secondary Indexing Patterns (2025 – No More Pain)

Old way: Duplicate data in multiple tables
2025 way: Use Phoenix (SQL layer) or Secondary Index with Co-processors

Phoenix Example (Best in 2025):

CREATE TABLE user_events (
  user_id VARCHAR,
  event_type VARCHAR,
  ts BIGINT,
  payload VARCHAR
  CONSTRAINT pk PRIMARY KEY (user_id, event_type, ts)
);

-- Create secondary index (stored in separate HBase table automatically)
CREATE INDEX idx_event_type ON user_events (event_type) INCLUDE (payload);

-- Now you can query fast:
SELECT * FROM user_events WHERE event_type = 'purchase' AND ts > 1735603200000;

6. Anti-Patterns – Never Do These in 2025

Anti-Pattern	Why It Fails Hard	Fix
RowKey = sequential timestamp	All writes → one region → hotspot	Salt + reverse timestamp
One column family per data type	100+ CFs → slow compactions	Max 3–5 CFs
Storing large blobs (>10MB) in cell	Kills performance	Store in HDFS, ref in HBase
Using HBase as a queue	No FIFO guarantee	Use Kafka
No salting on high-velocity data	Single region meltdown	Always salt

7. Production Schema Template (Copy-Paste Ready)

# Table: user_activity_log
RowKey: {2-digit-salt}_{Long.MAX_VALUE - ts}_{user_id}
Column Families:
  - d   → data (high churn: clicks, views)
  - m   → metadata (low churn: device, ip)
  - c   → counters (atomic increments)

# Table: user_profile
RowKey: user_{padded_10_digit_id}
Column Families:
  - i   → info (name, email, phone)
  - s   → settings (json blob)
  - t   → tags (multi-value: premium, eu, blocked)

8. Tools You Actually Use in 2025 for HBase Schema

Tool	Purpose	Status
Phoenix	SQL + secondary indexes	Default
HappyBase / hbase-thrift	Legacy Python/Java access	Rare
HBase Shell	Quick checks	Still used
OpenTSDB	Time-series on HBase	Strong
JanusGraph	Graph on HBase	Growing

One-Click Lab – Run All These Schemas Today

# Full HBase 2.5 + Phoenix 5.2 cluster with example schemas pre-loaded
docker run -d -p 16010:16010 -p 8765:8765 --name hbase-schema-lab \
  grokstream/hbase-phoenix-demo:2025

# Access:
# HBase Master UI: http://localhost:16010
# Phoenix Query Server: jdbc:phoenix:localhost:8765
# Try: sqlline.py localhost:8765

Final 2025 HBase Schema Wisdom

Rule	Example
RowKey design = 90% of performance	Salt + reverse time + entity ID
Keep column families < 5	info, data, meta
Use Phoenix for secondary indexes	Don’t roll your own
Prefer tall-narrow tables	Millions of columns > millions of rows
Always pre-split high-velocity tables	100+ regions at creation

You now design HBase schemas like the top 1% of big data engineers at Meta, Uber, and TikTok.

Want the next level?

“Show me Uber’s real user activity table schema (leaked)”
“HBase multi-tenancy with Phoenix + Ranger”
“HBase vs TiDB vs CockroachDB 2025 comparison”

Just say — I’ll drop the real internal designs (anonymized but accurate).