Machine Learning

Machine learning concepts — equations, architectures, and estimators — each explained in a few plain sentences.

100 concepts. Regenerated daily.

Start swiping →

Write the formula for Pearson correlation coefficient

r = Σ((xi - x̄)(yi - ȳ)) / (√Σ(xi - x̄)² * √Σ(yi - ȳ)²)

Shannon's channel capacity: C = B log₂(1 + S/N) bits per second

Shannon's formula: C = B log₂(1 + S/N) defines channel capacity in bits/s

What the Y combinator does: enables recursion in languages without named functions

The Y combinator enables recursive function definitions in lambda calculus and similar functional languages

Why the L1 unit ball is a diamond shape and the L2 unit ball is a circle

L1 norm: Manhattan distance, L2 norm: Euclidean distance

What tl.arange(0, BLOCK_SIZE) creates: a range of indices within the current block

`np.arange(0, BLOCK_SIZE)` generates an array of indices from 0 to BLOCK_SIZE-1

What label smoothing does: replaces one-hot [0,0,1,0] with [0.025, 0.025, 0.925, 0.025]

Label smoothing regularizes models by adjusting target distributions

What CAP theorem states: you can have at most 2 of consistency, availability, partition tolerance

CAP theorem: Consistency, Availability, Partition Tolerance; only 2 can be fully achieved simultaneously

What score matching does: learns the gradient of the log-density without normalizing

Score matching approximates log-density gradients for variational inference without normalization

Write the equation for cross-entropy loss

H(y, p) = -Σ(y_i * log(p_i)) for all i

What bloom filters do: probabilistically check set membership with no false negatives

Bloom filters: Efficient set membership testing with zero false negatives

What DDPM stands for: Denoising Diffusion Probabilistic Model

DDPM: Denoising Diffusion Probabilistic Model for generative tasks

What weight tying does in language models: shares embedding and output projection matrices

Language models use tied weights to share embedding and output projection matrices, enhancing parameter efficiency

What bank conflicts are in shared memory — multiple threads accessing the same bank

Shared memory conflicts arise when multiple threads concurrently access the same bank in a banking system

Why the curse of dimensionality makes nearest neighbor search unreliable

High-dimensional spaces increase distance ambiguity, reducing nearest neighbor search reliability

What expected calibration error (ECE) measures: gap between confidence and accuracy

ECE quantifies the discrepancy between a model's predicted confidence and its actual accuracy

Write the formula for Lagrangian L(x,λ) = f(x) - λg(x)

L(x,λ) = f(x) - λ∫g(x)dx, where λ is Lagrange multiplier

What mixup does: trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j

Trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j represents a weighted average of two points in a convex set

Greedy vs beam search decoding: greedy picks best token, beam maintains k candidates

Greedy decoding selects one token, while beam search retains multiple candidates

Write the attention score formula before softmax: e_ij = a(s_i, h_j)

Attention score formula: e_ij = a(s_i, h_j) = exp(tanh(W_s * s_i + W_h * h_j + b))

What an instrumental variable does: isolates causal effect when you can't randomize

Instrumental variables estimate causal effects by using a variable that influences the independent variable but not the dependent variable

What AdaGrad does: divides learning rate by sqrt of sum of squared gradients

AdaGrad adapts learning rates based on historical gradients, reducing for frequently updated features

What 300-dim word2vec encodes: trained on word co-occurrence with skip-gram window

300-dim Word2Vec trained on word co-occurrence with skip-gram window

How to write a fused softmax kernel in Triton: load row, compute max, subtract, exp, sum, divide

`fused_softmax_kernel(input, output): row_max = max_pool2d(input, row_length); exp_diff = exp(input - row_max); softmax_sum = sum(exp_diff, axis=1); output = exp_diff / softmax_sum`

Write the formula for PageRank equation

Pr(A) = (1-d) + d * Σ(Pr(W)/L(W)) for all outbound links W leading to page A

Why most transformer operations are memory-bound, not compute-bound

Transformer operations rely heavily on matrix multiplications, which are memory-intensive tasks

What LDPC codes are: low-density parity-check codes used in 5G and WiFi

LDPC codes: Low-density parity-check codes for 5G and WiFi error correction

What Chebyshev's inequality says: P(|X-μ| ≥ kσ) ≤ 1/k²

Chebyshev's inequality states that the probability of a random variable deviating from its mean by at least k standard deviations is less than or equal to 1/k²

Why temperature T in softmax(x/T) controls entropy: T→0 is argmax, T→∞ is uniform

As T approaches zero, softmax becomes argmax, maximizing entropy; T→∞ yields uniform distribution, minimizing entropy

Why ALiBi allows length extrapolation better than learned position embeddings

ALiBi uses fixed-length position encodings, enabling efficient length extrapolation without model retraining

A p-value < 0.05 means: if H₀ is true, this result has <5% probability

A p-value < 0.05 indicates a less than 5% chance of observing data as extreme as this if the null hypothesis is true

Float16 vs bfloat16: bfloat16 has same exponent range as float32, less precision but more stable

bfloat16 retains float32's exponent range, offers reduced precision, and increased stability

How tiling works in matrix multiplication — loading blocks into shared memory

Tiling in matrix multiplication optimizes cache usage by partitioning matrices into submatrices

Write the equation for sigmoid function σ(x) = 1/(1+e^-x)

σ(x) = 1 / (1 + e^-x)

What LoRA does — adds trainable low-rank matrices A and B where ΔW = BA

LoRA: Augments model weights with low-rank matrices A, B, ΔW = BA

What the rank-nullity theorem says: rank(A) + nullity(A) = n for an m×n matrix

Rank-nullity theorem: Rank(A) + Nullity(A) = Number of columns (n) in A

Why second-order methods (Newton's) converge faster but are expensive: O(n³) per step

Newton's method has quadratic convergence but requires cubic computational cost per iteration

What causal masking does — prevents attention to future tokens in the decoder

Causal masking in transformer models prevents attention to future tokens in the decoder, preserving autoregressive property

What the Cayley-Hamilton theorem says: every matrix satisfies its own characteristic equation

Cayley-Hamilton theorem: A square matrix satisfies its characteristic polynomial

What denoising score matching does: learns to denoise, which equals learning the score

Denoising score matching learns to remove noise, enhancing signal representation and interpretation

What continuous batching does — adds new requests to a running batch without waiting

Continuous batching enables immediate request addition, enhancing throughput and efficiency

What database sharding does: splits data across machines by a partition key

Database sharding distributes data across multiple machines using a partition key for scalability and performance

What the context window limit means: maximum number of tokens the model can process at once

Context window limit restricts the model's input size to a fixed number of tokens for processing

Why non-convex loss landscapes are hard: many local minima and saddle points

Non-convex landscapes have numerous local minima and saddle points, complicating optimization

What multi-query attention (MQA) is — all Q heads share a single KV head

MQA: Multi-query attention with shared key-value head for efficient cross-query processing

Why attention is O(n²) in sequence length: every token attends to every other token

Attention mechanism's complexity arises from pairwise token interactions, leading to quadratic time complexity

Time complexity of binary search: O(log n) — halves search space each step

Binary search reduces search space by half with each iteration, achieving O(log n) complexity

Why L1 distance is called Manhattan distance — grid-like paths

L1 distance mimics grid-like city blocks, hence "Manhattan" distance

Write the formula for normal distribution PDF

Normal distribution PDF: 𝑋/(σ√2π) * e^(-(X-μ)^2/(2σ^2))

Write the formula for KL divergence D_KL(P||Q)

D_KL(P||Q) = Σ P(x) log(P(x)/Q(x)) for all x in the support of P

What the characteristic function φ(t) = E[e^(itX)] does: Fourier transform of the PDF

Characteristic function φ(t) = E[e^(itX)] represents the Fourier transform of the probability density function (PDF)

Time complexity of Dijkstra's algorithm: O((V+E) log V) with a priority queue

Dijkstra's algorithm: O((V+E) log V) using a Fibonacci heap

Time complexity of quicksort: O(n log n) average, O(n²) worst case

Quicksort's average-case time complexity: O(n log n), worst-case: O(n²)

Why proximal gradient descent is needed for L1 optimization

Proximal gradient descent handles non-differentiable L1 regularization, enabling sparse solutions

What Hoeffding's inequality bounds: tail probability of sum of bounded random variables

Hoeffding's inequality bounds the sum of bounded random variables' tail probability

How KV-cache reduces redundant computation in autoregressive generation

KV-cache minimizes redundant computations by storing intermediate results in autoregressive models

What consistent hashing does: minimizes remapping when nodes join/leave

Consistent hashing minimizes data redistribution during nodes' addition or removal

What operator fusion does at the compiler level: merges adjacent ops to reduce memory traffic

Operator fusion optimizes code by combining adjacent operations into a single instruction, minimizing memory access

What calibration means: a model predicting 80% should be correct 80% of the time

Calibration: Model's predicted probabilities match actual outcomes' frequencies

What a message queue decouples: producer and consumer can operate at different speeds

Message queues decouple producers and consumers, allowing asynchronous processing

What maximum likelihood estimation does: find θ maximizing P(data|θ)

Maximizes θ to maximize the probability of observed data given θ

Why the determinant tells you about volume scaling under a linear transformation

Determinant indicates volume change factor by the linear transformation's scaling effect

What the Nyquist theorem says: sample at ≥ 2× the highest frequency to avoid aliasing

Nyquist theorem: Sample rate ≥ 2*highest frequency to prevent frequency aliasing

What cooperative groups enable in CUDA: flexible thread synchronization patterns

CUDA allows cooperative groups for flexible thread synchronization patterns via atomic operations and events

What a CUDA kernel is — a function that runs on thousands of GPU threads in parallel

CUDA kernel: Parallel function executed on GPU's thousands of threads simultaneously

Why SGD with momentum escapes local minima better than vanilla SGD

Momentum SGD accumulates velocity, helping to overcome shallow local minima

What a thread block is in CUDA — a group of threads that share shared memory

A CUDA thread block is a group of threads executing in parallel, sharing global and shared memory

Why L1 regularization produces sparse solutions — the diamond corners touch axes

L1 regularization promotes sparsity by penalizing non-zero coefficients, effectively driving some to zero

What importance sampling does: reweights samples from proposal to estimate target expectation

Importance sampling reweights samples from a proposal distribution to approximate the expectation of a target distribution

What a sigma-algebra is: a collection of sets closed under complement and countable union

A sigma-algebra is a set of subsets closed under complementation and countable unions

Write the policy gradient theorem equation

E[\nabla_\theta J(\theta)] = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s)]

What rejection sampling does: samples from target by accepting/rejecting proposals

Rejection sampling generates samples from a target distribution by accepting or rejecting proposals based on a comparison with a uniform distribution

What cutmix does: replaces a patch of one image with a patch from another

Patch-based image cutmix swaps image sections for data augmentation

Why memory coalescing matters — adjacent threads reading adjacent memory addresses

Memory coalescing reduces cache misses, improving multithreaded application performance

What IS (Inception Score) measures: diversity and quality of generated images

Inception Score quantifies image diversity and generated images' quality

Write the formula for covariance between X and Y

Cov(X, Y) = Σ((Xi - X̄)(Yi - Ȳ)) / (n - 1)

How the Transformer encoder differs from decoder — encoder is bidirectional, decoder is causal

Transformer encoder processes input bidirectionally, while decoder uses causal (left-to-right) masking

What the compute-optimal training ratio is: roughly 20 tokens per parameter

Optimal training ratio: Approximately 20 tokens/parameter

What LSM trees optimize: write-heavy workloads by buffering writes in memory

LSM trees optimize write-heavy workloads through in-memory buffering

Mutual information I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

Mutual information measures dependence between variables X and Y

What the Yoneda lemma says: an object is determined by its relationships to all other objects

Yoneda lemma: Morphisms from an object to all others uniquely determine the object

What sufficient statistics are — compress data without losing information about θ

Sufficient statistics for θ are those that capture all necessary information to estimate the parameter

What BPE tokenization does: iteratively merges the most frequent byte pairs

BPE tokenization merges the most frequent byte pairs iteratively to create subword units

What a qubit is: a quantum bit that exists in superposition of |0⟩ and |1⟩

A qubit: a quantum bit in simultaneous |0⟩ and |1⟩ states

What AWQ does differently — activation-aware weight quantization preserves important weights

AWQ quantizes weights while preserving critical activation values for neural network efficiency

Write the formula for Mahalanobis distance

D^2 = (x - μ)^T Σ^(-1) (x - μ)

How is the coefficient of determination (R^2) calculated from a simple linear regression model?

R^2 = 1 - (SSR/SST), where SSR is the sum of squared residuals and SST is the total sum of squares

Reed-Solomon error correction: What is the mathematical formula representing the minimum number of redundant symbols required to correct a given number of symbol errors in a Reed-Solomon code?

Minimum redundant symbols = (2t + 1) * k, where t = (number of symbol errors)/(2t + 1) and k = (codeword length - data length)

How does the concept of homotopy type theory (HoTT) enable the unification of homotopy theory and higher category theory in mathematical foundations?

HoTT unifies homotopy theory and higher category theory by using types as mathematical objects, enabling homotopical and categorical structures to coexist

How does the choice of norm affect the shape of the unit ball in a given vector space, specifically comparing the properties of L1 and L∞ norms?

L1 norms create diamond-shaped unit balls, while L∞ norms yield cube-shaped unit balls

What numpy.arange(0, num_elements) creates: an array of evenly spaced values within the specified range, used for indexing or iteration purposes?

`numpy.arange(0, num_elements)` creates an array of `num_elements` evenly spaced values starting from 0

How does batch normalization contribute to training deep neural networks: by normalizing input features within each batch to have zero mean and unit variance to accelerate convergence and improve generalization?

Batch normalization stabilizes and accelerates deep learning training by normalizing input features

Which distributed systems property, as per the CAP theorem, suggests that during network partition, a system can either provide strong consistency but not both high availability and partition tolerance?

Consistency over availability and partition tolerance

How does score matching utilize the Fisher Information Matrix to learn the parameters of a probabilistic model without normalizing the score?

Score matching estimates parameters by minimizing the Kullback-Leibler divergence between empirical and model score distributions

What is the formula for calculating the mutual information between two discrete random variables X and Y?

I(X;Y) = ∑∑ P(x,y) log(P(x,y)/(P(x)P(y)))

How do Bloom filters utilize bit arrays to efficiently perform probabilistic set membership tests with minimal false positives?

Bloom filters use bit arrays to store hashed positions, allowing quick membership checks with controlled false positives

What does LSTM stand for in the context of neural networks: Long Short-Term Memory

LSTM: A type of recurrent neural network capable of learning long-term dependencies

How does attention mechanism in transformer models enhance language understanding and processing by dynamically weighting input tokens during sequence encoding?

Attention mechanisms assign dynamic weights to input tokens, enhancing contextual understanding and sequence processing in transformer models

How do lock-free data structures manage concurrent access to shared memory in a multithreaded environment?

Lock-free data structures use atomic operations to ensure concurrent access without traditional locking mechanisms

How does the curse of dimensionality affect the performance and accuracy of clustering algorithms in high-dimensional datasets?

High-dimensional data can lead to sparse clusters, reducing clustering accuracy due to increased distance between points

How does the Root Mean Square Error (RMSE) quantify the difference between predicted values and observed values in regression analysis?

RMSE measures the average magnitude of prediction errors in regression, squaring and averaging residuals