Machine learning concepts — equations, architectures, and estimators — each explained in a few plain sentences.
100 concepts. Regenerated daily.
Start swiping →Write the formula for Pearson correlation coefficient
r = Σ((xi - x̄)(yi - ȳ)) / (√Σ(xi - x̄)² * √Σ(yi - ȳ)²)
Shannon's channel capacity: C = B log₂(1 + S/N) bits per second
Shannon's formula: C = B log₂(1 + S/N) defines channel capacity in bits/s
What the Y combinator does: enables recursion in languages without named functions
The Y combinator enables recursive function definitions in lambda calculus and similar functional languages
Why the L1 unit ball is a diamond shape and the L2 unit ball is a circle
L1 norm: Manhattan distance, L2 norm: Euclidean distance
What tl.arange(0, BLOCK_SIZE) creates: a range of indices within the current block
`np.arange(0, BLOCK_SIZE)` generates an array of indices from 0 to BLOCK_SIZE-1
What label smoothing does: replaces one-hot [0,0,1,0] with [0.025, 0.025, 0.925, 0.025]
Label smoothing regularizes models by adjusting target distributions
What CAP theorem states: you can have at most 2 of consistency, availability, partition tolerance
CAP theorem: Consistency, Availability, Partition Tolerance; only 2 can be fully achieved simultaneously
What score matching does: learns the gradient of the log-density without normalizing
Score matching approximates log-density gradients for variational inference without normalization
Write the equation for cross-entropy loss
H(y, p) = -Σ(y_i * log(p_i)) for all i
What bloom filters do: probabilistically check set membership with no false negatives
Bloom filters: Efficient set membership testing with zero false negatives
What DDPM stands for: Denoising Diffusion Probabilistic Model
DDPM: Denoising Diffusion Probabilistic Model for generative tasks
What weight tying does in language models: shares embedding and output projection matrices
Language models use tied weights to share embedding and output projection matrices, enhancing parameter efficiency
What bank conflicts are in shared memory — multiple threads accessing the same bank
Shared memory conflicts arise when multiple threads concurrently access the same bank in a banking system
Why the curse of dimensionality makes nearest neighbor search unreliable
High-dimensional spaces increase distance ambiguity, reducing nearest neighbor search reliability
What expected calibration error (ECE) measures: gap between confidence and accuracy
ECE quantifies the discrepancy between a model's predicted confidence and its actual accuracy
Write the formula for Lagrangian L(x,λ) = f(x) - λg(x)
L(x,λ) = f(x) - λ∫g(x)dx, where λ is Lagrange multiplier
What mixup does: trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j
Trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j represents a weighted average of two points in a convex set
Greedy vs beam search decoding: greedy picks best token, beam maintains k candidates
Greedy decoding selects one token, while beam search retains multiple candidates
Write the attention score formula before softmax: e_ij = a(s_i, h_j)
Attention score formula: e_ij = a(s_i, h_j) = exp(tanh(W_s * s_i + W_h * h_j + b))
What an instrumental variable does: isolates causal effect when you can't randomize
Instrumental variables estimate causal effects by using a variable that influences the independent variable but not the dependent variable
What AdaGrad does: divides learning rate by sqrt of sum of squared gradients
AdaGrad adapts learning rates based on historical gradients, reducing for frequently updated features
What 300-dim word2vec encodes: trained on word co-occurrence with skip-gram window
300-dim Word2Vec trained on word co-occurrence with skip-gram window
How to write a fused softmax kernel in Triton: load row, compute max, subtract, exp, sum, divide
`fused_softmax_kernel(input, output): row_max = max_pool2d(input, row_length); exp_diff = exp(input - row_max); softmax_sum = sum(exp_diff, axis=1); output = exp_diff / softmax_sum`
Write the formula for PageRank equation
Pr(A) = (1-d) + d * Σ(Pr(W)/L(W)) for all outbound links W leading to page A
Why most transformer operations are memory-bound, not compute-bound
Transformer operations rely heavily on matrix multiplications, which are memory-intensive tasks
What LDPC codes are: low-density parity-check codes used in 5G and WiFi
LDPC codes: Low-density parity-check codes for 5G and WiFi error correction
What Chebyshev's inequality says: P(|X-μ| ≥ kσ) ≤ 1/k²
Chebyshev's inequality states that the probability of a random variable deviating from its mean by at least k standard deviations is less than or equal to 1/k²
Why temperature T in softmax(x/T) controls entropy: T→0 is argmax, T→∞ is uniform
As T approaches zero, softmax becomes argmax, maximizing entropy; T→∞ yields uniform distribution, minimizing entropy
Why ALiBi allows length extrapolation better than learned position embeddings
ALiBi uses fixed-length position encodings, enabling efficient length extrapolation without model retraining
A p-value < 0.05 means: if H₀ is true, this result has <5% probability
A p-value < 0.05 indicates a less than 5% chance of observing data as extreme as this if the null hypothesis is true
Float16 vs bfloat16: bfloat16 has same exponent range as float32, less precision but more stable
bfloat16 retains float32's exponent range, offers reduced precision, and increased stability
How tiling works in matrix multiplication — loading blocks into shared memory
Tiling in matrix multiplication optimizes cache usage by partitioning matrices into submatrices
Write the equation for sigmoid function σ(x) = 1/(1+e^-x)
σ(x) = 1 / (1 + e^-x)
What LoRA does — adds trainable low-rank matrices A and B where ΔW = BA
LoRA: Augments model weights with low-rank matrices A, B, ΔW = BA
What the rank-nullity theorem says: rank(A) + nullity(A) = n for an m×n matrix
Rank-nullity theorem: Rank(A) + Nullity(A) = Number of columns (n) in A
Why second-order methods (Newton's) converge faster but are expensive: O(n³) per step
Newton's method has quadratic convergence but requires cubic computational cost per iteration
What causal masking does — prevents attention to future tokens in the decoder
Causal masking in transformer models prevents attention to future tokens in the decoder, preserving autoregressive property
What the Cayley-Hamilton theorem says: every matrix satisfies its own characteristic equation
Cayley-Hamilton theorem: A square matrix satisfies its characteristic polynomial
What denoising score matching does: learns to denoise, which equals learning the score
Denoising score matching learns to remove noise, enhancing signal representation and interpretation
What continuous batching does — adds new requests to a running batch without waiting
Continuous batching enables immediate request addition, enhancing throughput and efficiency
What database sharding does: splits data across machines by a partition key
Database sharding distributes data across multiple machines using a partition key for scalability and performance
What the context window limit means: maximum number of tokens the model can process at once
Context window limit restricts the model's input size to a fixed number of tokens for processing
Why non-convex loss landscapes are hard: many local minima and saddle points
Non-convex landscapes have numerous local minima and saddle points, complicating optimization
What multi-query attention (MQA) is — all Q heads share a single KV head
MQA: Multi-query attention with shared key-value head for efficient cross-query processing
Why attention is O(n²) in sequence length: every token attends to every other token
Attention mechanism's complexity arises from pairwise token interactions, leading to quadratic time complexity
Time complexity of binary search: O(log n) — halves search space each step
Binary search reduces search space by half with each iteration, achieving O(log n) complexity
Why L1 distance is called Manhattan distance — grid-like paths
L1 distance mimics grid-like city blocks, hence "Manhattan" distance
Write the formula for normal distribution PDF
Normal distribution PDF: 𝑋/(σ√2π) * e^(-(X-μ)^2/(2σ^2))
Write the formula for KL divergence D_KL(P||Q)
D_KL(P||Q) = Σ P(x) log(P(x)/Q(x)) for all x in the support of P
What the characteristic function φ(t) = E[e^(itX)] does: Fourier transform of the PDF
Characteristic function φ(t) = E[e^(itX)] represents the Fourier transform of the probability density function (PDF)
Time complexity of Dijkstra's algorithm: O((V+E) log V) with a priority queue
Dijkstra's algorithm: O((V+E) log V) using a Fibonacci heap
Time complexity of quicksort: O(n log n) average, O(n²) worst case
Quicksort's average-case time complexity: O(n log n), worst-case: O(n²)
Why proximal gradient descent is needed for L1 optimization
Proximal gradient descent handles non-differentiable L1 regularization, enabling sparse solutions
What Hoeffding's inequality bounds: tail probability of sum of bounded random variables
Hoeffding's inequality bounds the sum of bounded random variables' tail probability
How KV-cache reduces redundant computation in autoregressive generation
KV-cache minimizes redundant computations by storing intermediate results in autoregressive models
What consistent hashing does: minimizes remapping when nodes join/leave
Consistent hashing minimizes data redistribution during nodes' addition or removal
What operator fusion does at the compiler level: merges adjacent ops to reduce memory traffic
Operator fusion optimizes code by combining adjacent operations into a single instruction, minimizing memory access
What calibration means: a model predicting 80% should be correct 80% of the time
Calibration: Model's predicted probabilities match actual outcomes' frequencies
What a message queue decouples: producer and consumer can operate at different speeds
Message queues decouple producers and consumers, allowing asynchronous processing
What maximum likelihood estimation does: find θ maximizing P(data|θ)
Maximizes θ to maximize the probability of observed data given θ
Why the determinant tells you about volume scaling under a linear transformation
Determinant indicates volume change factor by the linear transformation's scaling effect
What the Nyquist theorem says: sample at ≥ 2× the highest frequency to avoid aliasing
Nyquist theorem: Sample rate ≥ 2*highest frequency to prevent frequency aliasing
What cooperative groups enable in CUDA: flexible thread synchronization patterns
CUDA allows cooperative groups for flexible thread synchronization patterns via atomic operations and events
What a CUDA kernel is — a function that runs on thousands of GPU threads in parallel
CUDA kernel: Parallel function executed on GPU's thousands of threads simultaneously
Why SGD with momentum escapes local minima better than vanilla SGD
Momentum SGD accumulates velocity, helping to overcome shallow local minima
What a thread block is in CUDA — a group of threads that share shared memory
A CUDA thread block is a group of threads executing in parallel, sharing global and shared memory
Why L1 regularization produces sparse solutions — the diamond corners touch axes
L1 regularization promotes sparsity by penalizing non-zero coefficients, effectively driving some to zero
What importance sampling does: reweights samples from proposal to estimate target expectation
Importance sampling reweights samples from a proposal distribution to approximate the expectation of a target distribution
What a sigma-algebra is: a collection of sets closed under complement and countable union
A sigma-algebra is a set of subsets closed under complementation and countable unions
Write the policy gradient theorem equation
E[\nabla_\theta J(\theta)] = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s)]
What rejection sampling does: samples from target by accepting/rejecting proposals
Rejection sampling generates samples from a target distribution by accepting or rejecting proposals based on a comparison with a uniform distribution
What cutmix does: replaces a patch of one image with a patch from another
Patch-based image cutmix swaps image sections for data augmentation
Why memory coalescing matters — adjacent threads reading adjacent memory addresses
Memory coalescing reduces cache misses, improving multithreaded application performance
What IS (Inception Score) measures: diversity and quality of generated images
Inception Score quantifies image diversity and generated images' quality
Write the formula for covariance between X and Y
Cov(X, Y) = Σ((Xi - X̄)(Yi - Ȳ)) / (n - 1)
How the Transformer encoder differs from decoder — encoder is bidirectional, decoder is causal
Transformer encoder processes input bidirectionally, while decoder uses causal (left-to-right) masking
What the compute-optimal training ratio is: roughly 20 tokens per parameter
Optimal training ratio: Approximately 20 tokens/parameter
What LSM trees optimize: write-heavy workloads by buffering writes in memory
LSM trees optimize write-heavy workloads through in-memory buffering
Mutual information I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)
Mutual information measures dependence between variables X and Y
What the Yoneda lemma says: an object is determined by its relationships to all other objects
Yoneda lemma: Morphisms from an object to all others uniquely determine the object
What sufficient statistics are — compress data without losing information about θ
Sufficient statistics for θ are those that capture all necessary information to estimate the parameter
What BPE tokenization does: iteratively merges the most frequent byte pairs
BPE tokenization merges the most frequent byte pairs iteratively to create subword units
What a qubit is: a quantum bit that exists in superposition of |0⟩ and |1⟩
A qubit: a quantum bit in simultaneous |0⟩ and |1⟩ states
What AWQ does differently — activation-aware weight quantization preserves important weights
AWQ quantizes weights while preserving critical activation values for neural network efficiency
Write the formula for Mahalanobis distance
D^2 = (x - μ)^T Σ^(-1) (x - μ)
How is the coefficient of determination (R^2) calculated from a simple linear regression model?
R^2 = 1 - (SSR/SST), where SSR is the sum of squared residuals and SST is the total sum of squares
Reed-Solomon error correction: What is the mathematical formula representing the minimum number of redundant symbols required to correct a given number of symbol errors in a Reed-Solomon code?
Minimum redundant symbols = (2t + 1) * k, where t = (number of symbol errors)/(2t + 1) and k = (codeword length - data length)
How does the concept of homotopy type theory (HoTT) enable the unification of homotopy theory and higher category theory in mathematical foundations?
HoTT unifies homotopy theory and higher category theory by using types as mathematical objects, enabling homotopical and categorical structures to coexist
How does the choice of norm affect the shape of the unit ball in a given vector space, specifically comparing the properties of L1 and L∞ norms?
L1 norms create diamond-shaped unit balls, while L∞ norms yield cube-shaped unit balls
What numpy.arange(0, num_elements) creates: an array of evenly spaced values within the specified range, used for indexing or iteration purposes?
`numpy.arange(0, num_elements)` creates an array of `num_elements` evenly spaced values starting from 0
How does batch normalization contribute to training deep neural networks: by normalizing input features within each batch to have zero mean and unit variance to accelerate convergence and improve generalization?
Batch normalization stabilizes and accelerates deep learning training by normalizing input features
Which distributed systems property, as per the CAP theorem, suggests that during network partition, a system can either provide strong consistency but not both high availability and partition tolerance?
Consistency over availability and partition tolerance
How does score matching utilize the Fisher Information Matrix to learn the parameters of a probabilistic model without normalizing the score?
Score matching estimates parameters by minimizing the Kullback-Leibler divergence between empirical and model score distributions
What is the formula for calculating the mutual information between two discrete random variables X and Y?
I(X;Y) = ∑∑ P(x,y) log(P(x,y)/(P(x)P(y)))
How do Bloom filters utilize bit arrays to efficiently perform probabilistic set membership tests with minimal false positives?
Bloom filters use bit arrays to store hashed positions, allowing quick membership checks with controlled false positives
What does LSTM stand for in the context of neural networks: Long Short-Term Memory
LSTM: A type of recurrent neural network capable of learning long-term dependencies
How does attention mechanism in transformer models enhance language understanding and processing by dynamically weighting input tokens during sequence encoding?
Attention mechanisms assign dynamic weights to input tokens, enhancing contextual understanding and sequence processing in transformer models
How do lock-free data structures manage concurrent access to shared memory in a multithreaded environment?
Lock-free data structures use atomic operations to ensure concurrent access without traditional locking mechanisms
How does the curse of dimensionality affect the performance and accuracy of clustering algorithms in high-dimensional datasets?
High-dimensional data can lead to sparse clusters, reducing clustering accuracy due to increased distance between points
How does the Root Mean Square Error (RMSE) quantify the difference between predicted values and observed values in regression analysis?
RMSE measures the average magnitude of prediction errors in regression, squaring and averaging residuals