Loading...
Development

Statistics & Math for Data Science

Phase 2: Statistics & Math for Data Science

(Months 2–3 | 8 Weeks | 5–7 hrs/day)

Goal: Don’t just run models — understand them.
Master the math & stats behind ML, A/B tests, and causal inference.

Why?

  • 90% of DS interviews test stats intuition
  • Avoid p-hacking, overfitting, spurious correlations
  • Explain "Why did the model predict X?"

Week-by-Week Roadmap

WeekFocusHours
1Descriptive Stats + Distributions30
2Probability & Bayes30
3Hypothesis Testing & p-values35
4Confidence Intervals & Power30
5A/B Testing Deep Dive35
6Correlation vs Causation30
7Linear Algebra for ML35
8Capstone: A/B Test Report25

Week 1: Descriptive Statistics & Distributions

Core Concepts

ConceptFormulaIntuition
Meanμ = Σx / nAverage
MedianMiddle valueRobust to outliers
Varianceσ² = Σ(x-μ)²/nSpread
Std Devσ = √σ²Typical deviation
Skewness(mean - median)/σTail direction
KurtosisHeavy tails?Outlier proneness

Distributions

DistributionWhenPMF/PDF
NormalHeights, errorsBell curve
BinomialCoin flipsP(k) = C(n,k)p^k(1-p)^(n-k)
PoissonEvents in timeP(k) = λ^k e^(-λ)/k!
ExponentialTime between eventsf(x) = λe^(-λx)

Practice

import numpy as np
import seaborn as sns

data = np.random.normal(100, 15, 1000)
sns.histplot(data, kde=True)
print(f"Mean: {data.mean():.1f}, Std: {data.std():.1f}")

Resources:


Week 2: Probability & Bayes’ Theorem

Key Rules

RuleFormula
AdditionP(A∪B) = P(A) + P(B) - P(A∩B)
MultiplicationP(A∩B) = P(A)P(B|A)
ComplementP(A') = 1 - P(A)

Bayes’ Theorem

P(A|B) = [P(B|A) * P(A)] / P(B)

Example:

Spam filter:

  • P(Spam) = 20%
  • P("win" | Spam) = 80%
  • P("win" | Ham) = 5%
    → P(Spam | "win") = ?
p_spam = 0.2
p_win_spam = 0.8
p_win_ham = 0.05
p_win = p_win_spam * p_spam + p_win_ham * (1 - p_spam)

p_spam_win = (p_win_spam * p_spam) / p_win
print(f"P(Spam|'win') = {p_spam_win:.1%}")
# → 76.2%

Resources:


Week 3: Hypothesis Testing & p-values

Framework

  1. Null (H₀): No effect
  2. Alternative (H₁): Effect exists
  3. Test Statisticp-value
  4. α = 0.05 → reject H₀ if p < 0.05

Common Tests

TestUse
t-testCompare means (small n)
z-testCompare means (large n)
Chi-squareCategorical data
ANOVA3+ groups
from scipy import stats
group_a = [25, 30, 28, 35]
group_b = [20, 22, 19, 25]
t_stat, p_val = stats.ttest_ind(group_a, group_b)
print(f"p-value: {p_val:.4f}")  # → 0.008 → reject H₀

Resources:

  • StatQuest: p-values
  • Book: Practical Statistics for Data Scientists (Ch 3–4)

Week 4: Confidence Intervals & Statistical Power

Confidence Interval (95%)

mean ± 1.96 * (σ / √n)
import numpy as np
data = np.random.normal(100, 15, 100)
se = 15 / np.sqrt(100)
ci = (100 - 1.96*se, 100 + 1.96*se)
print(f"95% CI: [{ci[0]:.1f}, {ci[1]:.1f}]")

Power = 1 - β

Probability of detecting an effect if it exists
80% power is standard

Factors:

  • Effect size ↑ → Power ↑
  • Sample size ↑ → Power ↑
  • α ↑ → Power ↑

Resources:

  • StatQuest: Power
  • G*Power (free software)

Week 5: A/B Testing Deep Dive

End-to-End Process

graph TD
    A[Define Metric] --> B[Random Split]
    B --> C[Run Test]
    C --> D[Check AA]
    D --> E[t-test / z-test]
    E --> F[p < 0.05?]
    F -->|Yes| G[Winner]
    F -->|No| H[Inconclusive]

Practical Example

Goal: Does new checkout button increase conversion?

GroupUsersConversionsRate
A (Control)10,0004204.20%
B (Variant)10,0004854.85%
from statsmodels.stats.proportion import proportions_ztest
count = np.array([485, 420])
nobs = np.array([10000, 10000])
z_stat, p_val = proportions_ztest(count, nobs)
print(f"p-value: {p_val:.4f}")  # → 0.031 → **significant**

Resources:

  • Google A/B Testing Course (free)
  • Evan Miller’s Calculator (online)

Week 6: Correlation ≠ Causation

Common Pitfalls

ExampleCorrelationCausation?
Ice cream sales ↑ → Shark attacks ↑0.9No (both caused by summer)
Storks → Babies0.8No (both in rural areas)

Tools to Infer Causation

MethodUse
RCTGold standard
Propensity Score MatchingObservational
Difference-in-DifferencesPolicy changes
Instrumental VariablesNatural experiments

Resources:


Week 7: Linear Algebra for ML

Why It Matters

ML ConceptLinear Algebra
FeaturesVectors
DatasetMatrix
WeightsVector
PredictionDot product
PCAEigenvectors

Key Operations

A = np.array([[1, 2], [3, 4]])
b = np.array([5, 6])
x = np.dot(A, b)        # Matrix-vector
eigvals, eigvecs = np.linalg.eig(A)  # PCA

Resources:


Week 8: Capstone – A/B Test Report

Deliverable: ab_test_report.pdf

# A/B Test: New Checkout Button

## Hypothesis
H₀: Conversion rate same  
H₁: Variant > Control

## Results
| Group | n | Conversions | Rate |
|-------|----|--------------|------|
| A     | 10,000 | 420 | 4.20% |
| B     | 10,000 | 485 | 4.85% |

- **Lift**: +15.5%  
- **p-value**: 0.031  
- **95% CI**: [0.3%, 1.3%]  
- **Power**: 84%  
→ **Reject H₀**

## Recommendation
Roll out new button → **+6,500 conversions/year**

GitHub Repo: yourname/ab-test-capstone


Daily Schedule

TimeTask
9–10 AMWatch video (StatQuest / 3B1B)
10–12 PMCode + solve 10 problems
1–3 PMRead book chapter
3–4 PMExplain concept aloud
4–5 PMApply to dataset

Practice Problems (Solve 100+)

PlatformLink
StrataScratchstratascratch.com
DataCampStats Track
HackerRankSQL + Stats
LeetCodeMedium SQL

Assessment: Can You Explain?

QuestionYes/No
Why is p < 0.05 not proof?
Bayes: P(A|B) vs P(B|A)
95% CI interpretation
t-test vs z-test
Matrix multiplication in NN

All Yes → You passed Phase 2!


Free Resources Summary

TopicLink
StatQuestyoutube.com/c/joshstarmer
3Blue1Brownyoutube.com/c/3blue1brown
Khan Academykhanacademy.org
Practical Stats BookPDF
A/B Calculatorevanmiller.org/ab-testing

Pro Tips

  1. Teach it → record yourself explaining p-values
  2. Use real data → analyze your own A/B test
  3. Build a cheat sheetstats_cheat_sheet.pdf
  4. Interview prep → “Explain t-test in 2 mins”

Next: Phase 3 – Data Visualization

You understand the why → now show it.


Start Today:

  1. Watch StatQuest: Mean, Variance, Std Dev
  2. Open Jupyter:
import numpy as np
data = np.random.normal(100, 15, 1000)
print(f"Mean: {data.mean():.1f}, 95% in [{data.mean()-1.96*15/np.sqrt(1000):.1f}, {data.mean()+1.96*15/np.sqrt(1000):.1f}]")

Tag me when you finish your A/B report!
You now think like a Data Scientist.