Use RNN/LSTM for sequential data where order matters (mostly replaced by transformers)
Image: N509FZ, CC BY-SA 4.0, via Wikimedia Commons
Use RNN/LSTM for sequential data where order matters (mostly replaced by transformers)
transformers use LayerNorm not BatchNorm
LayerNorm normalizes across all features, accommodating variable-length sequences unlike BatchNorm, which relies on fixed-size batches
to use log-transform: when data is right-skewed or spans multiple orders of magnitude
Log-transform: Apply when data is right-skewed or spans multiple orders of magnitude
Pre-LN transformers are easier to train
Pre-LN transformers use residual connections, allowing gradients to flow more smoothly during backpropagation
Pre-LN
Pre-LN: LayerNorm before attention; Post-LN: LayerNorm after attention
Batch norm vs layer norm: BN across batch, LN across features
Batch norm (BN) normalizes across batch, layer norm (LN) normalizes across features; LN handles variable-length sequences
to use a CNN: for data with spatial structure like images or time series
CNNs excel in recognizing patterns in spatially structured data such as images or time series
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews