Instruction-level parallelism (ILP) achieves: Multiple operations per clock cycle
Image: Fensterblick., CC BY-SA 3.0, via Wikimedia Commons
Instruction-level parallelism (ILP) achieves: Multiple operations per clock cycle
arithmetic intensity is
Arithmetic intensity = FLOPs / Bytes accessed
instruction tuning does: fine-tunes on (instruction, response) pairs
Fine-tunes on (instruction, response) pairs
Von Neumann architecture
CPU must fetch both data and instructions from memory
Memory hierarchy
Memory hierarchy levels: registers → L1 → L2 → L3 → RAM → SSD → HDD (each ~10× slower)
tensor cores do 4x4 matrix multiply in one clock cycle
Tensor cores perform 4x4 matrix multiply using optimized GEMM (General Matrix Multiply) instructions in one clock cycle
Single instruction, multiple data
SIMD processes multiple data elements simultaneously
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews