
Context window limit restricts the model's input size to a fixed number of tokens for processing
Context window limit restricts the model's input size to a fixed number of tokens for processing
What the compute-optimal training ratio is: roughly 20 tokens per parameter
Optimal training ratio: Approximately 20 tokens/parameter
What BPE tokenization does: iteratively merges the most frequent byte pairs
BPE tokenization merges the most frequent byte pairs iteratively to create subword units
Why attention is O(n²) in sequence length: every token attends to every other token
Attention mechanism's complexity arises from pairwise token interactions, leading to quadratic time complexity
What continuous batching does — adds new requests to a running batch without waiting
Continuous batching enables immediate request addition, enhancing throughput and efficiency
What 300-dim word2vec encodes: trained on word co-occurrence with skip-gram window
300-dim Word2Vec trained on word co-occurrence with skip-gram window
What weight tying does in language models: shares embedding and output projection matrices
Language models use tied weights to share embedding and output projection matrices, enhancing parameter efficiency
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews