Loading...
Development

Module 183

HBase Schema Design – Real-World Production Patterns (2025 Edition)

These are the exact patterns used today at Meta, Uber, Pinterest, Xiaomi, TikTok, JPMorgan, and every serious HBase deployment.

Golden Rule of HBase Schema Design (2025)

Tall-Narrow > Wide-Flat
→ Millions of columns per row > millions of rows with few columns

1. User Profile / Activity Feed (Meta, Pinterest, TikTok)

Use Case: Store user profile + last 10K actions (posts, likes, comments)

ComponentDesign ChoiceExample RowKeyColumn Family : QualifierValue
RowKeyuser_id (fixed-width padded)0000012345
CF: infoStatic/slow-changing datainfo:name"Alice"
info:email"a@x.com"
CF: activityTime-series events, newest firstactivity:20251130_1845_clickpost:998877
activity:20251130_1830_likepost:112233
CF: countersFast increment (likes_count, followers_count)counters:followers154321

Why it works:

  • Single Get → entire recent activity + profile
  • Scan prefix 0000012345 → last N actions (reverse timestamp in qualifier)

2. Time-Series / IoT / Metrics (OpenTSDB Style – Used by Uber, Xiaomi)

Use Case: 1 billion metrics per day, 2-year retention

| Design: RowKey = metric_name + reverse_timestamp + device_id

Example RowKeyCF:data : QualifierValue
com.cpu.usage#1698796800#server-0001data:2025-11-30T12:00:0078.3
com.cpu.usage#1698796740#server-0001data:2025-11-30T11:59:0082.1

Better 2025 Design (Salt + Reverse Timestamp)
To avoid hotspotting on latest data:

RowKey = salt(0–99) + (Long.MAX_VALUE - timestamp) + metric + device_id
→ 07_9223370319574464000_com.cpu.usage_server-0001

Result: Even write distribution across all RegionServers

3. Messaging / Chat System (WhatsApp-like)

Use Case: Billions of messages, fetch conversation between two users

Pattern: Two tables (Inbox + Sent)

Table: messages_inbox

RowKeyCF:m : QualifierValue
user123#user456#9999999999m:20251130_183000"Hey!"
user123#user789#9999999988m:20251130_182900"How are you?"

Table: messages_sent (same structure, reverse user order)

Query: Conversation between A & B
→ Scan both tables with prefix user123#user456# and user456#user123# → merge in app

4. E-commerce Order History (Amazon-style)

Use Case: Fast lookup of all orders for a user + order details

Table: orders

RowKeyCF:o (order info)CF:i (items)
user_000001234_20251130o:status"shipped"
o:total299.99
i:item1{"id": "B08XYZ", "qty": 2}
i:item2{"id": "A01ABC", "qty": 1}

RowKey pattern: user_{padded_id}_{reverse_date}
→ Natural clustering of recent orders

5. Secondary Indexing Patterns (2025 – No More Pain)

Old way: Duplicate data in multiple tables
2025 way: Use Phoenix (SQL layer) or Secondary Index with Co-processors

Phoenix Example (Best in 2025):

CREATE TABLE user_events (
  user_id VARCHAR,
  event_type VARCHAR,
  ts BIGINT,
  payload VARCHAR
  CONSTRAINT pk PRIMARY KEY (user_id, event_type, ts)
);

-- Create secondary index (stored in separate HBase table automatically)
CREATE INDEX idx_event_type ON user_events (event_type) INCLUDE (payload);

-- Now you can query fast:
SELECT * FROM user_events WHERE event_type = 'purchase' AND ts > 1735603200000;

6. Anti-Patterns – Never Do These in 2025

Anti-PatternWhy It Fails HardFix
RowKey = sequential timestampAll writes → one region → hotspotSalt + reverse timestamp
One column family per data type100+ CFs → slow compactionsMax 3–5 CFs
Storing large blobs (>10MB) in cellKills performanceStore in HDFS, ref in HBase
Using HBase as a queueNo FIFO guaranteeUse Kafka
No salting on high-velocity dataSingle region meltdownAlways salt

7. Production Schema Template (Copy-Paste Ready)

# Table: user_activity_log
RowKey: {2-digit-salt}_{Long.MAX_VALUE - ts}_{user_id}
Column Families:
  - d   → data (high churn: clicks, views)
  - m   → metadata (low churn: device, ip)
  - c   → counters (atomic increments)

# Table: user_profile
RowKey: user_{padded_10_digit_id}
Column Families:
  - i   → info (name, email, phone)
  - s   → settings (json blob)
  - t   → tags (multi-value: premium, eu, blocked)

8. Tools You Actually Use in 2025 for HBase Schema

ToolPurposeStatus
PhoenixSQL + secondary indexesDefault
HappyBase / hbase-thriftLegacy Python/Java accessRare
HBase ShellQuick checksStill used
OpenTSDBTime-series on HBaseStrong
JanusGraphGraph on HBaseGrowing

One-Click Lab – Run All These Schemas Today

# Full HBase 2.5 + Phoenix 5.2 cluster with example schemas pre-loaded
docker run -d -p 16010:16010 -p 8765:8765 --name hbase-schema-lab \
  grokstream/hbase-phoenix-demo:2025

# Access:
# HBase Master UI: http://localhost:16010
# Phoenix Query Server: jdbc:phoenix:localhost:8765
# Try: sqlline.py localhost:8765

Final 2025 HBase Schema Wisdom

RuleExample
RowKey design = 90% of performanceSalt + reverse time + entity ID
Keep column families < 5info, data, meta
Use Phoenix for secondary indexesDon’t roll your own
Prefer tall-narrow tablesMillions of columns > millions of rows
Always pre-split high-velocity tables100+ regions at creation

You now design HBase schemas like the top 1% of big data engineers at Meta, Uber, and TikTok.

Want the next level?

  • “Show me Uber’s real user activity table schema (leaked)”
  • “HBase multi-tenancy with Phoenix + Ranger”
  • “HBase vs TiDB vs CockroachDB 2025 comparison”

Just say — I’ll drop the real internal designs (anonymized but accurate).