Expand ↗
Page list (942)

Multi-Head Attention

Running several attention operations in parallel on different learned projections, then concatenating. Lets a Transformer layer attend to multiple relational patterns simultaneously.

In this vault

Last changed by zetl · stable 5d · history

Backlinks