Machine Learning Core

Goal: Build & Evaluate Models Like a Pro

Focus: Scikit-learn + Real Projects

Why?

95% of DS roles require model building & evaluation

AUC > 0.9 = job offer

Kaggle Top 20% = senior-level interview

Week-by-Week Roadmap

Week	Focus	Hours
1–2	Regression (Linear + Logistic)	60
3–4	Classification (Trees, SVM, KNN)	60
5–6	Model Evaluation & Cross-Validation	60
7–8	Ensemble Methods (RF, XGBoost)	60
9–10	Hyperparameter Tuning & Pipelines	60
11–12	Capstone: 2 Kaggle Competitions	80

Tools Setup (Day 1)

pip install scikit-learn pandas numpy matplotlib seaborn xgboost optuna kaggle

# config.py
import os
os.environ['KAGGLE_USERNAME'] = 'yourname'
os.environ['KAGGLE_KEY'] = 'yourkey'

Week 1–2: Regression Deep Dive

1. Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.2f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")

Project: House Prices

Goal: RMSE < 25,000 → Top 20%

2. Logistic Regression

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]

from sklearn.metrics import roc_auc_score
print(f"AUC: {roc_auc_score(y_test, y_prob):.4f}")

Project: Titanic

Goal: AUC > 0.85 → Top 10%

Resources:

Andrew Ng ML Course (Weeks 1–3) – coursera.org
Hands-On ML – Ch 2–4

Week 3–4: Classification Algorithms

Algorithm	Use Case	Code
Decision Tree	Interpretable	`DecisionTreeClassifier(max_depth=5)`
Random Forest	Robust	`RandomForestClassifier(n_estimators=100)`
SVM	Small, clean data	`SVC(kernel='rbf', probability=True)`
KNN	Simple baseline	`KNeighborsClassifier(n_neighbors=5)`

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

models = {
    'RF': RandomForestClassifier(n_estimators=200),
    'SVM': SVC(probability=True),
    'KNN': KNeighborsClassifier()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    print(f"{name} AUC: {auc:.4f}")

Project: Customer Churn

Goal: F1 > 0.65

Resources:

Hands-On ML – Ch 5–6
Kaggle Intermediate ML – kaggle.com/learn/intermediate-machine-learning

Week 5–6: Model Evaluation Masterclass

Confusion Matrix

from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True')
plt.xlabel('Predicted')

Key Metrics

Metric	Formula	When to Use
Accuracy	(TP+TN)/(Total)	Balanced
Precision	TP/(TP+FP)	Minimize false positives
Recall	TP/(TP+FN)	Catch all positives
F1	2×(P×R)/(P+R)	Imbalanced
AUC-ROC	Area under ROC	Ranking quality

Cross-Validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")

StatQuest Videos:

Week 7–8: Ensemble Power (RF + XGBoost)

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='auc'
)
model.fit(X_train, y_train, 
          eval_set=[(X_test, y_test)], 
          early_stopping_rounds=50, 
          verbose=False)

Project: Porto Seguro

Goal: Gini > 0.28 → Top 5%

Week 9–10: Pipelines & Hyperparameter Tuning

Scikit-learn Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

numeric_features = ['age', 'fare']
categorical_features = ['sex', 'embarked']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(), categorical_features)
])

pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', xgb.XGBClassifier())
])

Hyperparameter Tuning

# Grid Search
from sklearn.model_selection import GridSearchCV
param_grid = {'model__max_depth': [3, 5, 7]}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc')
grid.fit(X, y)

# Optuna (Faster!)
import optuna
def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('lr', 0.01, 0.3)
    }
    model = xgb.XGBClassifier(**params)
    return cross_val_score(model, X, y, cv=3, scoring='roc_auc').mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

Week 11–12: Capstone – Kaggle Top 20% in 2 Comps

Project 1: House Prices

Feature Engineering: TotalSF, Age, HasPool
Model: XGBoost + Optuna
Target: RMSE < 0.12 (log scale)

Project 2: Santander Customer Transaction

Anonymized features → PCA + XGBoost
Target: AUC > 0.90

Deliverables (GitHub: yourname/ml-core-capstone)

ml-core-capstone/
├── house_prices/
│   ├── notebook.ipynb
│   ├── submission.csv (RMSE: 0.118)
│   └── model.pkl
├── santander/
│   ├── notebook.ipynb
│   └── submission.csv (AUC: 0.902)
└── README.md

`README.md` (Hiring Manager Magnet)

# ML Core Capstone: Kaggle Top 20%

## House Prices (RMSE: 0.118 – Top 18%)
- Feature eng: TotalSF, Age, Neighborhood encoding
- XGBoost + Optuna (50 trials)
- Cross-validation: 5-fold

## Santander (AUC: 0.902 – Top 15%)
- PCA on 200 anon features
- Early stopping + class weights

**Tech**: Scikit-learn, XGBoost, Optuna, Pandas  
**Live**: [kaggle.com/yourname](https://www.kaggle.com/yourname)

Interview Prep: Can You Answer?

Question	Your Answer
"Explain overfitting"	High train acc, low test → use CV
"AUC vs Accuracy"	AUC robust to imbalance
"Why XGBoost?"	Gradient boosting + regularization
"Pipeline benefits"	Reproducible, prevents leakage
"Optuna vs GridSearch"	Bayesian, faster convergence

Assessment: Can You Do This?

Task	Yes/No
Build end-to-end pipeline	☐
Achieve AUC > 0.85 on Titanic	☐
Tune XGBoost with Optuna	☐
Explain confusion matrix	☐
Submit Kaggle (Top 20%)	☐

All Yes → You passed Phase 4!

Free Resources Summary

Resource	Link
Andrew Ng ML	coursera.org/learn/machine-learning
Hands-On ML Book	GitHub
Kaggle Learn	kaggle.com/learn
StatQuest	youtube.com/c/joshstarmer
Optuna Docs	optuna.org

Pro Tips

Always use pipelines → no data leakage
Log everything → MLflow (next phase)
Submit early, submit often → Kaggle leaderboard
Write blogs → "How I got Top 20% with XGBoost"

Next: Phase 5 – Advanced ML & MLOps

You can build models → now deploy them.

Start Now:

kaggle competitions download -c house-prices-advanced-regression-techniques
unzip house-prices-advanced-regression-techniques.zip

Tag me when you hit Kaggle Top 20%!
You’re now a real Machine Learning engineer.

Machine Learning Core

Machine Learning Core

Goal: Build & Evaluate Models Like a Pro

Week-by-Week Roadmap

Tools Setup (Day 1)

Week 1–2: Regression Deep Dive

1. Linear Regression

2. Logistic Regression

Week 3–4: Classification Algorithms

Week 5–6: Model Evaluation Masterclass

Confusion Matrix

Key Metrics

Cross-Validation

Week 7–8: Ensemble Power (RF + XGBoost)

Week 9–10: Pipelines & Hyperparameter Tuning

Scikit-learn Pipeline

Hyperparameter Tuning

Week 11–12: Capstone – Kaggle Top 20% in 2 Comps

Project 1: House Prices

Project 2: Santander Customer Transaction

README.md (Hiring Manager Magnet)

Interview Prep: Can You Answer?

Assessment: Can You Do This?

Free Resources Summary

Pro Tips

Next: Phase 5 – Advanced ML & MLOps

`README.md` (Hiring Manager Magnet)