Greedy decoding selects one token, while beam search retains multiple candidates
Greedy decoding selects one token, while beam search retains multiple candidates
What causal masking does — prevents attention to future tokens in the decoder
Causal masking in transformer models prevents attention to future tokens in the decoder, preserving autoregressive property
How the Transformer encoder differs from decoder — encoder is bidirectional, decoder is causal
Transformer encoder processes input bidirectionally, while decoder uses causal (left-to-right) masking
Time complexity of binary search: O(log n) — halves search space each step
Binary search reduces search space by half with each iteration, achieving O(log n) complexity
What BPE tokenization does: iteratively merges the most frequent byte pairs
BPE tokenization merges the most frequent byte pairs iteratively to create subword units
What the compute-optimal training ratio is: roughly 20 tokens per parameter
Optimal training ratio: Approximately 20 tokens/parameter
What score matching does: learns the gradient of the log-density without normalizing
Score matching approximates log-density gradients for variational inference without normalization
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews