Module 03: CNNs, RNNs, Transformers
Outcomes
- Compare CNN, RNN/LSTM, and Transformer inductive biases.
- Explain attention and implement a minimal self-attention block.
- Choose architectures based on data modality and constraints.
CNNs
Watch:
-
Stanford CS231n: CNNs for Visual Recognition
- Lecture 5 (CNNs): https://www.youtube.com/watch?v=bNb2fEVKeEo
- Lecture 9 (Architectures): https://www.youtube.com/watch?v=DAOcjicFr1Y
-
StatQuest: Neural Network series
Key Architectures to Know:
- LeNet-5 (basic)
- AlexNet (ReLU, dropout)
- VGG (small filters)
- ResNet (skip connections) ← MOST IMPORTANT
- Inception (parallel branches)
RNNs & LSTMs
Watch:
-
StatQuest: RNN/LSTM
-
Andrew Ng: Sequence Models (Coursera Course 5)
Transformers (CRITICAL for Meta/Google)
Watch (in order):
-
Illustrated Transformer (blog + video)
-
StatQuest: Transformer
- Attention: https://www.youtube.com/watch?v=PSs6nxngL6k
- Transformer: https://www.youtube.com/watch?v=zxQyTK8quyY
-
Andrej Karpathy: Let's build GPT
Must Implement:
# Self-Attention from Scratch
import numpy as np
def self_attention(Q, K, V):
"""
Q, K, V: (seq_len, d_model)
"""
d_k = K.shape[-1]
# Attention scores
scores = np.dot(Q, K.T) / np.sqrt(d_k) # (seq_len, seq_len)
# Softmax
attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
# Weighted sum
output = np.dot(attention_weights, V) # (seq_len, d_model)
return output, attention_weights
# Multi-Head Attention
def multi_head_attention(X, num_heads, d_model):
d_k = d_model // num_heads
heads = []
for _ in range(num_heads):
W_q = np.random.randn(d_model, d_k)
W_k = np.random.randn(d_model, d_k)
W_v = np.random.randn(d_model, d_k)
Q = np.dot(X, W_q)
K = np.dot(X, W_k)
V = np.dot(X, W_v)
head, _ = self_attention(Q, K, V)
heads.append(head)
# Concatenate heads
concat = np.concatenate(heads, axis=-1)
# Final linear projection
W_o = np.random.randn(d_model, d_model)
output = np.dot(concat, W_o)
return output
Interview checkpoints
- Explain why attention scales better than RNNs for long contexts.
- Compare CNN receptive fields vs Transformer attention patterns.
- Describe common failure modes (overfitting, instability) and fixes.
Deep dive links
- 20-ML-Core/Guide/Deep Learning (Guide)
- 20-ML-Core/Guide/Specialized Topics (Guide)
- 20-ML-Core/Concepts Tracker#Deep Learning
- 20-ML-Core/Concepts Tracker#NLP
- 20-ML-Core/Concepts Tracker#RecSys
- 20-ML-Core/ML Cheat Sheet
Comments
Share your approach or ask questions
?
|
Markdown supported
Sign in to post
Loading comments...