Greedy decoding selects one token, while beam search retains multiple candidates

Greedy vs beam search decoding: greedy picks best token, beam maintains k candidates

Greedy decoding selects one token, while beam search retains multiple candidates

Related concepts

What causal masking does — prevents attention to future tokens in the decoder

Causal masking in transformer models prevents attention to future tokens in the decoder, preserving autoregressive property

How the Transformer encoder differs from decoder — encoder is bidirectional, decoder is causal

Transformer encoder processes input bidirectionally, while decoder uses causal (left-to-right) masking

Time complexity of binary search: O(log n) — halves search space each step

Binary search reduces search space by half with each iteration, achieving O(log n) complexity

What BPE tokenization does: iteratively merges the most frequent byte pairs

BPE tokenization merges the most frequent byte pairs iteratively to create subword units

What the compute-optimal training ratio is: roughly 20 tokens per parameter

Optimal training ratio: Approximately 20 tokens/parameter

What score matching does: learns the gradient of the log-density without normalizing

Score matching approximates log-density gradients for variational inference without normalization

Swipe through 100 ML concepts daily