Advanced ML & MLOps

**Phase 5: Advanced ML & MLOps

Goal: Production-Ready Models

Why?

80% of ML projects fail in production — master MLOps to join the top 20%

$150K+ salaries for roles like "MLOps Engineer"

2025 Trends: AutoML pipelines, federated learning, edge deployment

Week-by-Week Roadmap

Week	Focus	Hours
1–2	XGBoost / LightGBM Mastery	60
3–4	Feature Engineering & NLP Basics	60
5–6	Time Series Forecasting	60
7–8	Docker & Containerization	60
9–10	MLflow / DVC + FastAPI Deployment	60
11–12	Capstone: End-to-End Fraud System	80

Tools Setup (Day 1)

pip install xgboost lightgbm feature-engine transformers datasets scikit-learn pandas numpy matplotlib seaborn optuna mlflow dvc fastapi uvicorn docker

# config.py
import os
os.environ['MLFLOW_TRACKING_URI'] = 'http://localhost:5000'

Week 1–2: XGBoost / LightGBM – Kaggle Competition Level

XGBoost vs LightGBM (2025 Comparison)

Aspect	XGBoost	LightGBM
Speed	Fast, but slower on large data	2–10x faster leaf-wise growth
Memory	High for large datasets	Lower histogram-based
Accuracy	Excellent, robust	Often better on tabular data
GPU Support	Yes (cuML)	Native CUDA 10x speedup

XGBoost Example

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = xgb.XGBClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    tree_method='hist',  # 2025 default
    device='cuda'  # GPU!
)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=50)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"AUC: {auc:.4f}")

LightGBM Example (Faster!)

import lightgbm as lgb

train_data = lgb.Dataset(X_train, label=y_train)
params = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'device': 'gpu'  # CUDA
}
model = lgb.train(params, train_data, num_boost_round=1000, valid_sets=[train_data])

Project: Kaggle: Porto Seguro

Goal: Gini > 0.30 with LightGBM GPU → Top 5%

Resources:

Machine Learning Mastery: Gradient Boosting Tutorial
Kaggle Kernels: LightGBM vs XGBoost
GPU Guide: 10x Speed Tutorial

Week 3–4: Feature Engineering & NLP Basics

Feature Engineering with Feature-Engine

from feature_engine.imputation import MeanMedianImputer
from feature_engine.encoding import OneHotEncoder, RareLabelEncoder
from feature_engine.creation import MathematicalCombination
from sklearn.pipeline import Pipeline

# Pipeline
pipe = Pipeline([
    ('imputer', MeanMedianImputer(imputation_method='median')),
    ('rare', RareLabelEncoder(tol=0.05, n_categories=5)),
    ('ohe', OneHotEncoder(top_categories=5, variables=['cat_var'])),
    ('combo', MathematicalCombination(variables_to_combine=['num1', 'num2'], math_operations=['sum']))
])
X_transformed = pipe.fit_transform(X)

Project: Titanic + Feature-Engine → AUC > 0.90

NLP Basics with Hugging Face

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset

dataset = load_dataset("imdb", split="train[:1000]")  # Subset

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
results = classifier("This movie is amazing!")
print(results)  # [{'label': 'POSITIVE', 'score': 0.999}]

Project: Sentiment Analysis on Tweets → Fine-tune DistilBERT

Resources:

Feature-Engine GitHub: Examples Repo
Hugging Face LLM Course: Free 2025 Update (Now covers LLMs + NLP foundations)

Week 5–6: Time Series Forecasting

Store Item Demand Kaggle

import pandas as pd
from prophet import Prophet

df = pd.read_csv('train.csv')  # Kaggle dataset
df['date'] = pd.to_datetime(df['date'])
df = df.groupby(['store', 'item', 'date'])['sales'].sum().reset_index()

model = Prophet(daily_seasonality=True)
forecast = model.fit(df[df['store']==1]).predict(pd.date_range('2018-01-01', periods=90))

from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(test['sales'], forecast['yhat'], squared=False)
print(f"RMSE: {rmse:.2f}")

Advanced: XGBoost for Multi-Series

from sktime.forecasting.compose import make_reduction
from xgboost import XGBRegressor

forecaster = make_reduction(XGBRegressor(), window_length=90)
forecaster.fit(y_train)
y_pred = forecaster.predict(fh=90)

Project: Store Item Demand → WRMSSE < 0.85

Resources:

Kaggle Kernels: Time Series Tutorial
GitHub Repo: Full Solution

Week 7–8: Docker for Data Science

Dockerfile for ML Project

FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Build & Run

docker build -t ml-app .
docker run -p 8000:8000 ml-app

Multi-Container with Docker Compose:

# docker-compose.yml
services:
  app:
    build: .
    ports:
      - "8000:8000"
  db:
    image: postgres:13
    environment:
      POSTGRES_DB: ml_db

Resources:

YouTube Tutorial: Krish Naik: Complete Docker for DS
Towards DS Guide: Docker Basics

Week 9–10: MLflow / DVC + FastAPI Deployment

MLflow for Experiment Tracking

import mlflow
import mlflow.xgboost

with mlflow.start_run():
    mlflow.log_param("max_depth", 6)
    mlflow.log_metric("auc", auc)
    mlflow.xgboost.log_model(model, "model")
mlflow ui  # Run at localhost:5000

DVC for Data/Model Versioning

dvc init
dvc add data/train.csv
git add data/train.csv.dvc
dvc push  # To remote (S3/Git)
dvc repro  # Reproduce pipeline

FastAPI for Model API

from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import mlflow.pyfunc

app = FastAPI()
model = mlflow.pyfunc.load_model("models:/xgboost_model/Production")

class InputData(BaseModel):
    features: list[float]

@app.post("/predict")
def predict(data: InputData):
    pred = model.predict([data.features])
    return {"prediction": float(pred[0])}

Project: Deploy XGBoost to FastAPI + Docker

Resources:

MLflow + DVC Tutorial: Experiment Tracking
FastAPI ML Deployment: GeeksforGeeks Guide

Week 11–12: Capstone – End-to-End Fraud Detection System

Repo: yourname/fraud-mlops-capstone
Stack: LightGBM + Feature-Engine + Hugging Face NLP + Prophet TS + Docker + MLflow/DVC + FastAPI

Deliverables:

Pipeline: dvc.yaml for FE + Train
API: /predict endpoint (FastAPI)
Dashboard: Streamlit for monitoring (MLflow UI)
Docker: Multi-container deploy
Kaggle Submission: Top 10% on Fraud Dataset

README Snippet:

# Fraud Detection MLOps System
- **AUC: 0.95** (LightGBM + NLP features)
- **Deployed**: Docker + FastAPI
- **Tracked**: MLflow experiments + DVC data
- **Live**: http://localhost:8000/docs

Interview Prep: Key Questions

Question	Answer
"XGBoost vs LightGBM?"	LightGBM faster for large data; XGBoost more robust
"Why DVC?"	Git for code, DVC for large data/models
"FastAPI advantages?"	Async, auto-docs, Pydantic validation
"MLOps pipeline?"	FE → Train (MLflow) → Deploy (Docker/FastAPI) → Monitor

Assessment: Can You Build?

Task	Yes/No
LightGBM GPU train <5min	☐
Feature-Engine pipeline	☐
Fine-tune DistilBERT	☐
Prophet forecast RMSE <10	☐
Dockerized FastAPI API	☐
MLflow + DVC repro	☐

All Yes → Production-Ready!

Free Resources Summary

Topic	Link
XGBoost/LightGBM	Machine Learning Mastery
Feature-Engine	GitHub Examples
Hugging Face NLP	LLM Course
Time Series Kaggle	Demand Forecasting
Docker Tutorial	Krish Naik YouTube
MLflow/DVC	Tracking Guide
FastAPI Deploy	GeeksforGeeks

Pro Tips

GPU Everywhere: LightGBM CUDA for 10x speed
Version Everything: DVC for data, MLflow for models
Auto-Docs: FastAPI's /docs = instant portfolio
Kaggle Compete: Submit weekly → build resume

Next: Phase 6 – Big Data & Cloud

You deploy single models → now scale to petabytes.

Start Now:

dvc init && mlflow ui

Tag me on LinkedIn with your deployed API!
You're now an MLOps Engineer.