
BPE tokenization merges the most frequent byte pairs iteratively to create subword units
BPE tokenization merges the most frequent byte pairs iteratively to create subword units
What operator fusion does at the compiler level: merges adjacent ops to reduce memory traffic
Operator fusion optimizes code by combining adjacent operations into a single instruction, minimizing memory access
Why attention is O(n²) in sequence length: every token attends to every other token
Attention mechanism's complexity arises from pairwise token interactions, leading to quadratic time complexity
What the compute-optimal training ratio is: roughly 20 tokens per parameter
Optimal training ratio: Approximately 20 tokens/parameter
Greedy vs beam search decoding: greedy picks best token, beam maintains k candidates
Greedy decoding selects one token, while beam search retains multiple candidates
What the context window limit means: maximum number of tokens the model can process at once
Context window limit restricts the model's input size to a fixed number of tokens for processing
What weight tying does in language models: shares embedding and output projection matrices
Language models use tied weights to share embedding and output projection matrices, enhancing parameter efficiency
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews