BPE tokenization merges the most frequent byte pairs iteratively to create subword units

What BPE tokenization does: iteratively merges the most frequent byte pairs

BPE tokenization merges the most frequent byte pairs iteratively to create subword units

Related concepts

What operator fusion does at the compiler level: merges adjacent ops to reduce memory traffic

Operator fusion optimizes code by combining adjacent operations into a single instruction, minimizing memory access

Why attention is O(n²) in sequence length: every token attends to every other token

Attention mechanism's complexity arises from pairwise token interactions, leading to quadratic time complexity

What the compute-optimal training ratio is: roughly 20 tokens per parameter

Optimal training ratio: Approximately 20 tokens/parameter

Greedy vs beam search decoding: greedy picks best token, beam maintains k candidates

Greedy decoding selects one token, while beam search retains multiple candidates

What the context window limit means: maximum number of tokens the model can process at once

Context window limit restricts the model's input size to a fixed number of tokens for processing

What weight tying does in language models: shares embedding and output projection matrices

Language models use tied weights to share embedding and output projection matrices, enhancing parameter efficiency

Swipe through 100 ML concepts daily