Language models use tied weights to share embedding and output projection matrices, enhancing parameter efficiency

What weight tying does in language models: shares embedding and output projection matrices

Language models use tied weights to share embedding and output projection matrices, enhancing parameter efficiency

Related concepts

How does attention mechanism in transformer models enhance language understanding and processing by dynamically weighting input tokens during sequence encoding?

Attention mechanisms assign dynamic weights to input tokens, enhancing contextual understanding and sequence processing in transformer models

Why ALiBi allows length extrapolation better than learned position embeddings

ALiBi uses fixed-length position encodings, enabling efficient length extrapolation without model retraining

What 300-dim word2vec encodes: trained on word co-occurrence with skip-gram window

300-dim Word2Vec trained on word co-occurrence with skip-gram window

What LoRA does — adds trainable low-rank matrices A and B where ΔW = BA

LoRA: Augments model weights with low-rank matrices A, B, ΔW = BA

What AWQ does differently — activation-aware weight quantization preserves important weights

AWQ quantizes weights while preserving critical activation values for neural network efficiency

What causal masking does — prevents attention to future tokens in the decoder

Causal masking in transformer models prevents attention to future tokens in the decoder, preserving autoregressive property

One email a day: 5 concepts + the 5 stories that matter →

Swipe through 100 ML concepts daily

Open TickerNews