Statistics & Math for Data Science

Phase 2: Statistics & Math for Data Science

(Months 2–3 | 8 Weeks | 5–7 hrs/day)

Goal: Don’t just run models — understand them.
Master the math & stats behind ML, A/B tests, and causal inference.

Why?

90% of DS interviews test stats intuition

Avoid p-hacking, overfitting, spurious correlations

Explain "Why did the model predict X?"

Week-by-Week Roadmap

Week	Focus	Hours
1	Descriptive Stats + Distributions	30
2	Probability & Bayes	30
3	Hypothesis Testing & p-values	35
4	Confidence Intervals & Power	30
5	A/B Testing Deep Dive	35
6	Correlation vs Causation	30
7	Linear Algebra for ML	35
8	Capstone: A/B Test Report	25

Week 1: Descriptive Statistics & Distributions

Core Concepts

Concept	Formula	Intuition
Mean	`μ = Σx / n`	Average
Median	Middle value	Robust to outliers
Variance	`σ² = Σ(x-μ)²/n`	Spread
Std Dev	`σ = √σ²`	Typical deviation
Skewness	`(mean - median)/σ`	Tail direction
Kurtosis	Heavy tails?	Outlier proneness

Distributions

Distribution	When	PMF/PDF
Normal	Heights, errors	Bell curve
Binomial	Coin flips	`P(k) = C(n,k)p^k(1-p)^(n-k)`
Poisson	Events in time	`P(k) = λ^k e^(-λ)/k!`
Exponential	Time between events	`f(x) = λe^(-λx)`

Practice

import numpy as np
import seaborn as sns

data = np.random.normal(100, 15, 1000)
sns.histplot(data, kde=True)
print(f"Mean: {data.mean():.1f}, Std: {data.std():.1f}")

Resources:

StatQuest: Descriptive Stats
Kaggle: Statistics Course

Week 2: Probability & Bayes’ Theorem

Key Rules

Rule	Formula
Addition	`P(A∪B) = P(A) + P(B) - P(A∩B)`
Multiplication	`P(A∩B) = P(A)P(B\|A)`
Complement	`P(A') = 1 - P(A)`

Bayes’ Theorem

P(A|B) = [P(B|A) * P(A)] / P(B)

Example:

Spam filter:

P(Spam) = 20%

P("win" | Spam) = 80%

P("win" | Ham) = 5%
→ P(Spam | "win") = ?

p_spam = 0.2
p_win_spam = 0.8
p_win_ham = 0.05
p_win = p_win_spam * p_spam + p_win_ham * (1 - p_spam)

p_spam_win = (p_win_spam * p_spam) / p_win
print(f"P(Spam|'win') = {p_spam_win:.1%}")
# → 76.2%

Resources:

Khan Academy: Probability
3Blue1Brown: Bayes Video

Week 3: Hypothesis Testing & p-values

Framework

Null (H₀): No effect
Alternative (H₁): Effect exists
Test Statistic → p-value
α = 0.05 → reject H₀ if p < 0.05

Common Tests

Test	Use
t-test	Compare means (small n)
z-test	Compare means (large n)
Chi-square	Categorical data
ANOVA	3+ groups

from scipy import stats
group_a = [25, 30, 28, 35]
group_b = [20, 22, 19, 25]
t_stat, p_val = stats.ttest_ind(group_a, group_b)
print(f"p-value: {p_val:.4f}")  # → 0.008 → reject H₀

Resources:

StatQuest: p-values
Book: Practical Statistics for Data Scientists (Ch 3–4)

Week 4: Confidence Intervals & Statistical Power

Confidence Interval (95%)

mean ± 1.96 * (σ / √n)

import numpy as np
data = np.random.normal(100, 15, 100)
se = 15 / np.sqrt(100)
ci = (100 - 1.96*se, 100 + 1.96*se)
print(f"95% CI: [{ci[0]:.1f}, {ci[1]:.1f}]")

Power = 1 - β

Probability of detecting an effect if it exists
80% power is standard

Factors:

Effect size ↑ → Power ↑
Sample size ↑ → Power ↑
α ↑ → Power ↑

Resources:

StatQuest: Power
G*Power (free software)

Week 5: A/B Testing Deep Dive

End-to-End Process

graph TD
    A[Define Metric] --> B[Random Split]
    B --> C[Run Test]
    C --> D[Check AA]
    D --> E[t-test / z-test]
    E --> F[p < 0.05?]
    F -->|Yes| G[Winner]
    F -->|No| H[Inconclusive]

Practical Example

Goal: Does new checkout button increase conversion?

Group	Users	Conversions	Rate
A (Control)	10,000	420	4.20%
B (Variant)	10,000	485	4.85%

from statsmodels.stats.proportion import proportions_ztest
count = np.array([485, 420])
nobs = np.array([10000, 10000])
z_stat, p_val = proportions_ztest(count, nobs)
print(f"p-value: {p_val:.4f}")  # → 0.031 → **significant**

Resources:

Google A/B Testing Course (free)
Evan Miller’s Calculator (online)

Week 6: Correlation ≠ Causation

Common Pitfalls

Example	Correlation	Causation?
Ice cream sales ↑ → Shark attacks ↑	0.9	No (both caused by summer)
Storks → Babies	0.8	No (both in rural areas)

Tools to Infer Causation

Method	Use
RCT	Gold standard
Propensity Score Matching	Observational
Difference-in-Differences	Policy changes
Instrumental Variables	Natural experiments

Resources:

Causal Inference Book (free PDF)
StatQuest: Correlation vs Causation

Week 7: Linear Algebra for ML

Why It Matters

ML Concept	Linear Algebra
Features	Vectors
Dataset	Matrix
Weights	Vector
Prediction	Dot product
PCA	Eigenvectors

Key Operations

A = np.array([[1, 2], [3, 4]])
b = np.array([5, 6])
x = np.dot(A, b)        # Matrix-vector
eigvals, eigvecs = np.linalg.eig(A)  # PCA

Resources:

3Blue1Brown: Essence of Linear Algebra
MIT 18.06 (free)

Week 8: Capstone – A/B Test Report

Deliverable: `ab_test_report.pdf`

# A/B Test: New Checkout Button

## Hypothesis
H₀: Conversion rate same  
H₁: Variant > Control

## Results
| Group | n | Conversions | Rate |
|-------|----|--------------|------|
| A     | 10,000 | 420 | 4.20% |
| B     | 10,000 | 485 | 4.85% |

- **Lift**: +15.5%  
- **p-value**: 0.031  
- **95% CI**: [0.3%, 1.3%]  
- **Power**: 84%  
→ **Reject H₀**

## Recommendation
Roll out new button → **+6,500 conversions/year**

GitHub Repo: yourname/ab-test-capstone

Daily Schedule

Time	Task
9–10 AM	Watch video (StatQuest / 3B1B)
10–12 PM	Code + solve 10 problems
1–3 PM	Read book chapter
3–4 PM	Explain concept aloud
4–5 PM	Apply to dataset

Practice Problems (Solve 100+)

Platform	Link
StrataScratch	stratascratch.com
DataCamp	Stats Track
HackerRank	SQL + Stats
LeetCode	Medium SQL

Assessment: Can You Explain?

Question	Yes/No
Why is p < 0.05 not proof?	☐
Bayes: P(A\|B) vs P(B\|A)	☐
95% CI interpretation	☐
t-test vs z-test	☐
Matrix multiplication in NN	☐

All Yes → You passed Phase 2!

Free Resources Summary

Topic	Link
StatQuest	youtube.com/c/joshstarmer
3Blue1Brown	youtube.com/c/3blue1brown
Khan Academy	khanacademy.org
Practical Stats Book	PDF
A/B Calculator	evanmiller.org/ab-testing

Pro Tips

Teach it → record yourself explaining p-values
Use real data → analyze your own A/B test
Build a cheat sheet → stats_cheat_sheet.pdf
Interview prep → “Explain t-test in 2 mins”

Next: Phase 3 – Data Visualization

You understand the why → now show it.

Start Today:

Watch StatQuest: Mean, Variance, Std Dev
Open Jupyter:

import numpy as np
data = np.random.normal(100, 15, 1000)
print(f"Mean: {data.mean():.1f}, 95% in [{data.mean()-1.96*15/np.sqrt(1000):.1f}, {data.mean()+1.96*15/np.sqrt(1000):.1f}]")

Tag me when you finish your A/B report!
You now think like a Data Scientist.

Statistics & Math for Data Science

Phase 2: Statistics & Math for Data Science

(Months 2–3 | 8 Weeks | 5–7 hrs/day)

Week-by-Week Roadmap

Week 1: Descriptive Statistics & Distributions

Core Concepts

Distributions

Practice

Week 2: Probability & Bayes’ Theorem

Key Rules

Bayes’ Theorem

Week 3: Hypothesis Testing & p-values

Framework

Common Tests

Week 4: Confidence Intervals & Statistical Power

Confidence Interval (95%)

Power = 1 - β

Week 5: A/B Testing Deep Dive

End-to-End Process

Practical Example

Week 6: Correlation ≠ Causation

Common Pitfalls

Tools to Infer Causation

Week 7: Linear Algebra for ML

Why It Matters

Key Operations

Week 8: Capstone – A/B Test Report

Deliverable: ab_test_report.pdf

Daily Schedule

Practice Problems (Solve 100+)

Assessment: Can You Explain?

Free Resources Summary

Pro Tips

Next: Phase 3 – Data Visualization

Deliverable: `ab_test_report.pdf`