Keys, Queries, and Values: The celestial mechanics of attention
The attention mechanism is what makes Large Language Models like ChatGPT or DeepSeek talk well. But how does it work? One can see it as a mechanism that uses similarity to figure out what parts of the text to pay more or less attention to. For this, we use word embeddings.
I like to see word embeddings as words flying around in the universe, like planets and stars. In this case, the attention mechanism (the Keys, Queries, and Values matrices) define the fabric of this universe, and the laws of gravity, that resemble (yet in some ways are very different) to the laws of gravity that rule our universe.
Come join me in this celestial adventure in the universe of language!
See other videos in this LLM series
The attention mechanism in LLMs: https://www.youtube.com/watch?v=OxCpWwDCDFQ
The math behind attention mechanisms: https://www.youtube.com/watch?v=UPtG_38Oq8o
Transformer models: https://www.youtube.com/watch?v=qaWMOYf4ri8
Get the Grokking Machine Learning book!
https://manning.com/books/grokking-ma...
Discount code (40%): serranoyt
(Use the discount code on checkout)
01:55 Similarity
02:12 Embeddings
04:56 Attention
07:14 Dot product
09:29 Cosine similarity
11:10 The Keys and Queries matrices
14:19 Compressing and stretching dimensions
18:50 Combining dimensions
23:14 Asymmetric pull
40:57 Multi-head attention
45:14 The Value matrix
49:24 Summary
Generative AI
Data Analytics