ML Interview Cheat Sheet (Print This!)

Related:

MATH ESSENTIALS

Dot Product:     a·b = Σaᵢbᵢ = |a||b|cos(θ)
Cosine Sim:      cos(θ) = (a·b)/(|a||b|)
L2 Norm:         ||x||₂ = √(Σxᵢ²)

Bayes:           P(A|B) = P(B|A)P(A)/P(B)
Variance:        Var(X) = E[X²] - (E[X])²
Chain Rule:      d/dx f(g(x)) = f'(g(x))·g'(x)

CLASSICAL ML

Linear Reg:      ŷ = Xw + b,  L = (1/n)Σ(y-ŷ)²
Logistic:        ŷ = σ(Xw+b) = 1/(1+e⁻ᶻ)
Cross-Entropy:   L = -Σ[y log(ŷ) + (1-y)log(1-ŷ)]

L1 (Lasso):      L + λΣ|w| → Sparse weights
L2 (Ridge):      L + λΣw²  → Small weights

Entropy:         H = -Σ pᵢ log₂(pᵢ)
Gini:            G = 1 - Σ pᵢ²
Info Gain:       IG = H(parent) - Σ(|Sᵥ|/|S|)H(child)

METRICS

Precision = TP/(TP+FP)    "Of predicted +, correct?"
Recall    = TP/(TP+FN)    "Of actual +, found?"
F1        = 2PR/(P+R)     Harmonic mean

AUC-ROC:  TPR vs FPR (use for balanced)
AUC-PR:   Precision vs Recall (use for imbalanced)

NEURAL NETWORKS

Forward:         z = Wx + b,  a = σ(z)
Backprop:        ∂L/∂W = ∂L/∂a × ∂a/∂z × ∂z/∂W

ReLU:            max(0, z),     ReLU' = 1 if z>0
Sigmoid:         1/(1+e⁻ᶻ),    σ' = σ(1-σ)
Softmax:         exp(zᵢ)/Σexp(zⱼ)

Xavier Init:     W ~ N(0, 2/(nᵢₙ+nₒᵤₜ))  [tanh/sigmoid]
He Init:         W ~ N(0, 2/nᵢₙ)          [ReLU]

BatchNorm:       x̂ = (x-μ)/√(σ²+ε), y = γx̂+β
Dropout:         Zero p% of activations during training

OPTIMIZATION

SGD:             θ = θ - η∇L

Momentum:        v = βv + ∇L
                 θ = θ - ηv

Adam:            m = β₁m + (1-β₁)∇L
                 v = β₂v + (1-β₂)(∇L)²
                 θ = θ - η·m̂/√(v̂+ε)

Default Adam: β₁=0.9, β₂=0.999, ε=1e-8

CNNs

Output Size:     (W - K + 2P)/S + 1
Conv Params:     K² × Cᵢₙ × Cₒᵤₜ + Cₒᵤₜ

ResNet Skip:     y = F(x) + x  (gradient highway)
1×1 Conv:        Channel mixing, reduce dimensions

RNNs

Vanilla:         hₜ = tanh(Wₕhₜ₋₁ + Wₓxₜ)

LSTM Gates:
  Forget:        fₜ = σ(Wf[hₜ₋₁, xₜ])
  Input:         iₜ = σ(Wi[hₜ₋₁, xₜ])
  Output:        oₜ = σ(Wo[hₜ₋₁, xₜ])
  Cell:          Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ

TRANSFORMERS

Attention:       softmax(QKᵀ/√dₖ)V

Multi-Head:      Concat(head₁...headₕ)Wᴼ
                 headᵢ = Attention(QWᵢᵠ, KWᵢᴷ, VWᵢⱽ)

Positional:      PE(pos,2i) = sin(pos/10000^(2i/d))

Why scale √dₖ:   Keeps variance stable, prevents
                 softmax saturation

NLP

Word2Vec:        Predict context from word (skip-gram)
                 or word from context (CBOW)

BERT:            Bidirectional encoder, MLM + NSP
GPT:             Causal decoder, next token prediction

Perplexity:      PPL = exp(-1/N × Σlog P(wᵢ))

RECOMMENDER SYSTEMS

Matrix Factor:   R ≈ PQᵀ (user × item)

Two-Tower:       
  User Tower:    user features → user_emb (128d)
  Item Tower:    item features → item_emb (128d)
  Score:         dot(user_emb, item_emb)

Cold Start:      Use content features, popular items

ML SYSTEM DESIGN FRAMEWORK

1. CLARIFY (5 min)
   - Business goal? Constraints? Data?

2. METRICS (5 min)
   - Offline: Precision, Recall, AUC, NDCG
   - Online: CTR, conversion, engagement

3. DATA & FEATURES (10 min)
   - Sources, engineering, feature store

4. MODEL (10 min)
   - Selection, architecture, training

5. SERVING (10 min)
   - Pipeline, latency, scaling

6. MONITORING (5 min)
   - Data drift, model degradation, A/B

RAPID FIRE ANSWERS

Bias vs Variance?
→ Bias: wrong assumptions (underfit)
→ Variance: sensitive to data (overfit)

L1 vs L2?
→ L1: sparse (feature selection)
→ L2: small weights (regularization)

Bagging vs Boosting?
→ Bagging: parallel, reduce variance (RF)
→ Boosting: sequential, reduce bias (XGB)

BatchNorm vs LayerNorm?
→ BN: across batch (CNNs)
→ LN: across features (Transformers)

BERT vs GPT?
→ BERT: bidirectional encoder (understanding)
→ GPT: causal decoder (generation)

Why ReLU?
→ No vanishing gradient for z>0
→ Computationally simple
→ Sparse activations

CODE SNIPPETS

# Softmax (numerically stable)
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# Cross-entropy
def cross_entropy(y, y_hat):
    return -np.sum(y * np.log(y_hat + 1e-9))

# Self-attention
def attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    weights = softmax(scores)
    return weights @ V

Keep this handy during study sessions!

Comments

Share your approach or ask questions

0 comments
?
|
Markdown supported
Sign in to post

Loading comments...