RAG retrieves relevant documents before generating to reduce hallucination
Image: Bobulous, CC BY-SA 4.0, via Wikimedia Commons
RAG retrieves relevant documents before generating to reduce hallucination
Retrieval-augmented generation
RAG enables LLMs to access new information without retraining
paged attention (vLLM) improves serving throughput
Paged attention (vLLM) improves serving throughput by reducing latency through non-contiguous KV-cache pages, enabling faster data retrieval
structured pruning removes: entire filters or attention heads, not individual weights
Structured pruning removes entire filters or attention heads, not individual weights
ring attention does: distributes long sequences across multiple devices
Ring attention distributes long sequences across multiple devices
Flashbulb memory
Flashbulb memories are vivid but not always accurate
IS (Inception Score) measures: diversity and quality of generated images
Inception Score quantifies diversity and quality of generated images
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews