Loading...
Development

Machine Learning Core

Machine Learning Core

Goal: Build & Evaluate Models Like a Pro

Focus: Scikit-learn + Real Projects

Why?

  • 95% of DS roles require model building & evaluation
  • AUC > 0.9 = job offer
  • Kaggle Top 20% = senior-level interview

Week-by-Week Roadmap

WeekFocusHours
1–2Regression (Linear + Logistic)60
3–4Classification (Trees, SVM, KNN)60
5–6Model Evaluation & Cross-Validation60
7–8Ensemble Methods (RF, XGBoost)60
9–10Hyperparameter Tuning & Pipelines60
11–12Capstone: 2 Kaggle Competitions80

Tools Setup (Day 1)

pip install scikit-learn pandas numpy matplotlib seaborn xgboost optuna kaggle
# config.py
import os
os.environ['KAGGLE_USERNAME'] = 'yourname'
os.environ['KAGGLE_KEY'] = 'yourkey'

Week 1–2: Regression Deep Dive

1. Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.2f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")

Project: House Prices

Goal: RMSE < 25,000Top 20%


2. Logistic Regression

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]

from sklearn.metrics import roc_auc_score
print(f"AUC: {roc_auc_score(y_test, y_prob):.4f}")

Project: Titanic

Goal: AUC > 0.85Top 10%

Resources:

  • Andrew Ng ML Course (Weeks 1–3) – coursera.org
  • Hands-On ML – Ch 2–4

Week 3–4: Classification Algorithms

AlgorithmUse CaseCode
Decision TreeInterpretableDecisionTreeClassifier(max_depth=5)
Random ForestRobustRandomForestClassifier(n_estimators=100)
SVMSmall, clean dataSVC(kernel='rbf', probability=True)
KNNSimple baselineKNeighborsClassifier(n_neighbors=5)
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

models = {
    'RF': RandomForestClassifier(n_estimators=200),
    'SVM': SVC(probability=True),
    'KNN': KNeighborsClassifier()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    print(f"{name} AUC: {auc:.4f}")

Project: Customer Churn

Goal: F1 > 0.65

Resources:


Week 5–6: Model Evaluation Masterclass

Confusion Matrix

from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True')
plt.xlabel('Predicted')

Key Metrics

MetricFormulaWhen to Use
Accuracy(TP+TN)/(Total)Balanced
PrecisionTP/(TP+FP)Minimize false positives
RecallTP/(TP+FN)Catch all positives
F12×(P×R)/(P+R)Imbalanced
AUC-ROCArea under ROCRanking quality

Cross-Validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")

StatQuest Videos:


Week 7–8: Ensemble Power (RF + XGBoost)

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='auc'
)
model.fit(X_train, y_train, 
          eval_set=[(X_test, y_test)], 
          early_stopping_rounds=50, 
          verbose=False)

Project: Porto Seguro

Goal: Gini > 0.28Top 5%


Week 9–10: Pipelines & Hyperparameter Tuning

Scikit-learn Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

numeric_features = ['age', 'fare']
categorical_features = ['sex', 'embarked']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(), categorical_features)
])

pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', xgb.XGBClassifier())
])

Hyperparameter Tuning

# Grid Search
from sklearn.model_selection import GridSearchCV
param_grid = {'model__max_depth': [3, 5, 7]}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc')
grid.fit(X, y)

# Optuna (Faster!)
import optuna
def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('lr', 0.01, 0.3)
    }
    model = xgb.XGBClassifier(**params)
    return cross_val_score(model, X, y, cv=3, scoring='roc_auc').mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

Week 11–12: Capstone – Kaggle Top 20% in 2 Comps

Project 1: House Prices

  • Feature Engineering: TotalSF, Age, HasPool
  • Model: XGBoost + Optuna
  • Target: RMSE < 0.12 (log scale)

Project 2: Santander Customer Transaction

  • Anonymized features → PCA + XGBoost
  • Target: AUC > 0.90

Deliverables (GitHub: yourname/ml-core-capstone)

ml-core-capstone/
├── house_prices/
│   ├── notebook.ipynb
│   ├── submission.csv (RMSE: 0.118)
│   └── model.pkl
├── santander/
│   ├── notebook.ipynb
│   └── submission.csv (AUC: 0.902)
└── README.md

README.md (Hiring Manager Magnet)

# ML Core Capstone: Kaggle Top 20%

## House Prices (RMSE: 0.118 – Top 18%)
- Feature eng: TotalSF, Age, Neighborhood encoding
- XGBoost + Optuna (50 trials)
- Cross-validation: 5-fold

## Santander (AUC: 0.902 – Top 15%)
- PCA on 200 anon features
- Early stopping + class weights

**Tech**: Scikit-learn, XGBoost, Optuna, Pandas  
**Live**: [kaggle.com/yourname](https://www.kaggle.com/yourname)

Interview Prep: Can You Answer?

QuestionYour Answer
"Explain overfitting"High train acc, low test → use CV
"AUC vs Accuracy"AUC robust to imbalance
"Why XGBoost?"Gradient boosting + regularization
"Pipeline benefits"Reproducible, prevents leakage
"Optuna vs GridSearch"Bayesian, faster convergence

Assessment: Can You Do This?

TaskYes/No
Build end-to-end pipeline
Achieve AUC > 0.85 on Titanic
Tune XGBoost with Optuna
Explain confusion matrix
Submit Kaggle (Top 20%)

All Yes → You passed Phase 4!


Free Resources Summary

ResourceLink
Andrew Ng MLcoursera.org/learn/machine-learning
Hands-On ML BookGitHub
Kaggle Learnkaggle.com/learn
StatQuestyoutube.com/c/joshstarmer
Optuna Docsoptuna.org

Pro Tips

  1. Always use pipelines → no data leakage
  2. Log everything → MLflow (next phase)
  3. Submit early, submit often → Kaggle leaderboard
  4. Write blogs → "How I got Top 20% with XGBoost"

Next: Phase 5 – Advanced ML & MLOps

You can build models → now deploy them.


Start Now:

kaggle competitions download -c house-prices-advanced-regression-techniques
unzip house-prices-advanced-regression-techniques.zip

Tag me when you hit Kaggle Top 20%!
You’re now a real Machine Learning engineer.