Math Foundations (Guide)

Related:

1.1 Linear Algebra

Core Concepts to Master

ConceptWhat to KnowInterview Relevance
VectorsDot product, norm, unit vectorsEmbeddings, similarity
MatricesMultiplication, transpose, inverseWeight matrices in NN
Eigenvalues/EigenvectorsDefinition, computation, meaningPCA, understanding transformations
Matrix DecompositionSVD, eigendecompositionDimensionality reduction, recommendations
Linear IndependenceSpan, basis, rankFeature selection

Key Formulas to Memorize

Dot Product:
a · b = Σ(aᵢ × bᵢ) = |a| × |b| × cos(θ)

Cosine Similarity:
cos(θ) = (a · b) / (|a| × |b|)

L2 Norm (Euclidean):
||x||₂ = √(Σxᵢ²)

L1 Norm (Manhattan):
||x||₁ = Σ|xᵢ|

Matrix Multiplication:
(AB)ᵢⱼ = Σₖ Aᵢₖ × Bₖⱼ

Eigenvalue Equation:
Av = λv
where λ is eigenvalue, v is eigenvector

Interview Questions

Q: What is the geometric interpretation of dot product? A: Dot product measures how much two vectors point in the same direction. It equals |a||b|cos(θ), where θ is the angle between them. If orthogonal, dot product = 0.

Q: Why do we use cosine similarity instead of Euclidean distance for embeddings? A: Cosine similarity measures the angle between vectors, ignoring magnitude. This is important for text embeddings where document length shouldn't affect similarity - a longer document isn't necessarily more similar.

Q: Explain SVD and its applications in ML. A: SVD decomposes matrix A = UΣVᵀ where U and V are orthogonal and Σ is diagonal with singular values. Applications:

  • Dimensionality reduction (keep top k singular values)
  • Matrix completion (recommendations)
  • Noise reduction
  • Latent semantic analysis in NLP

Study Resources

  • 3Blue1Brown: Essence of Linear Algebra (YouTube playlist)
  • Khan Academy: Linear Algebra course
  • Book: "Linear Algebra Done Right" by Axler (advanced)

1.2 Probability & Statistics

Core Concepts to Master

ConceptWhat to KnowInterview Relevance
Probability BasicsConditional prob, independence, BayesNaive Bayes, probabilistic models
DistributionsNormal, Bernoulli, Binomial, PoissonData modeling, assumptions
Expectation & VarianceE[X], Var(X), propertiesLoss functions, bias-variance
Maximum LikelihoodMLE derivation, intuitionTraining objectives
Bayesian InferencePrior, posterior, likelihoodBayesian methods

Key Formulas to Memorize

Bayes' Theorem:
P(A|B) = P(B|A) × P(A) / P(B)

Chain Rule:
P(A,B,C) = P(A) × P(B|A) × P(C|A,B)

Expectation:
E[X] = Σ xᵢ × P(xᵢ)     [discrete]
E[X] = ∫ x × f(x) dx    [continuous]

Variance:
Var(X) = E[(X - μ)²] = E[X²] - (E[X])²

Normal Distribution:
f(x) = (1/√(2πσ²)) × exp(-(x-μ)²/(2σ²))

Bernoulli:
P(X=1) = p, P(X=0) = 1-p
E[X] = p, Var(X) = p(1-p)

Binomial:
P(X=k) = C(n,k) × pᵏ × (1-p)ⁿ⁻ᵏ

Maximum Likelihood:
θ_MLE = argmax_θ Π P(xᵢ|θ)
      = argmax_θ Σ log P(xᵢ|θ)

Interview Questions

Q: Derive the MLE for the mean of a Gaussian distribution. A:

Given: x₁, x₂, ..., xₙ ~ N(μ, σ²)

Likelihood: L(μ) = Π (1/√(2πσ²)) × exp(-(xᵢ-μ)²/(2σ²))

Log-likelihood: ℓ(μ) = -n/2 × log(2πσ²) - Σ(xᵢ-μ)²/(2σ²)

Take derivative and set to 0:
dℓ/dμ = Σ(xᵢ-μ)/σ² = 0

Solve: μ_MLE = (1/n) × Σxᵢ = sample mean

Q: Explain the bias-variance tradeoff. A:

Expected Error = Bias² + Variance + Irreducible Error

Bias: Error from simplifying assumptions (underfitting)
- High bias: Model too simple, misses patterns
- Example: Linear model for non-linear data

Variance: Error from sensitivity to training data (overfitting)
- High variance: Model memorizes training data
- Example: Deep tree with no regularization

Tradeoff: Increasing model complexity decreases bias but increases variance

Q: When would you use Bayesian vs. Frequentist approaches? A:

  • Bayesian: When you have prior knowledge, small data, need uncertainty estimates
  • Frequentist: Large data, need computational efficiency, interpretable point estimates

Study Resources

  • StatQuest: Probability playlist (YouTube)
  • Khan Academy: Statistics and Probability
  • Book: "All of Statistics" by Wasserman

1.3 Calculus & Optimization

Core Concepts to Master

ConceptWhat to KnowInterview Relevance
DerivativesChain rule, partial derivativesBackpropagation
GradientsGradient vector, directional derivativeOptimization
ConvexityConvex functions, local vs global minimaLoss landscape
OptimizationGradient descent variantsTraining neural networks
Lagrange MultipliersConstrained optimizationSVM derivation

Key Formulas to Memorize

Chain Rule:
d/dx f(g(x)) = f'(g(x)) × g'(x)

Gradient:
∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]

Gradient Descent Update:
θ = θ - α × ∇L(θ)

Gradient with Momentum:
v = β × v + (1-β) × ∇L(θ)
θ = θ - α × v

Adam Update:
m = β₁ × m + (1-β₁) × g
v = β₂ × v + (1-β₂) × g²
m̂ = m / (1-β₁ᵗ)
v̂ = v / (1-β₂ᵗ)
θ = θ - α × m̂ / (√v̂ + ε)

Convex Function:
f(λx + (1-λ)y) ≤ λf(x) + (1-λ)f(y)
for all λ ∈ [0,1]

Interview Questions

Q: Derive the gradient of the cross-entropy loss. A:

Cross-entropy: L = -Σ yᵢ × log(ŷᵢ)

For binary classification with sigmoid:
ŷ = σ(z) = 1/(1+e⁻ᶻ)
L = -[y×log(ŷ) + (1-y)×log(1-ŷ)]

Derivative of sigmoid:
dσ/dz = σ(z) × (1-σ(z)) = ŷ(1-ŷ)

Gradient:
∂L/∂z = -[y/ŷ × ŷ(1-ŷ) - (1-y)/(1-ŷ) × ŷ(1-ŷ)]
      = -[y(1-ŷ) - (1-y)ŷ]
      = -[y - yŷ - ŷ + yŷ]
      = ŷ - y

Beautiful result: gradient is just (prediction - target)!

Q: Why does Adam work better than vanilla SGD? A:

  • Adaptive learning rates: Different rates for each parameter
  • Momentum: Smooths gradients, escapes local minima
  • Bias correction: Handles early training instability
  • Works well with sparse gradients (NLP, recommendations)

Q: What problems can occur with gradient descent? A:

  • Vanishing gradients: Gradients → 0, weights don't update
  • Exploding gradients: Gradients → ∞, training unstable
  • Saddle points: Gradient = 0 but not minimum
  • Local minima: Stuck in suboptimal solution
  • Oscillation: Learning rate too high

Comments

Share your approach or ask questions

0 comments
?
|
Markdown supported
Sign in to post

Loading comments...