Tranformer

We explain how self-attention works with a simple concrete example. Consider a short sequence of three words: “The cat sleeps”.

First, we convert each word into an embedding vector. For this toy example, we use very small 4-dimensional embeddings:

The    = [1, 0, 1, 0]
cat    = [0, 1, 1, 1]
sleeps = [1, 1, 0, 1]

In self-attention, each word needs to “attend” to all other words in the sequence. This happens through three key steps using learned weight matrices ($W_q$, $W_k$, $W_v$) to transform the embeddings into queries, keys, and values. When we multiply our word embeddings by these matrices, we get

import numpy as np

# Input tokens (3 tokens, each with 3 features)
S = np.array([
    [1, 0, 1, 0],  # for "The"
    [0, 1, 1, 1],  # for "cat"
    [1, 1, 0, 1]   # for "sleeps"
])

# Initialize weights for Query, Key, and Value (4x3 matrices)
W_q = np.array(
    [
        [0.2, 0.4, 0.6, 0.8],
        [0.1, 0.3, 0.5, 0.7],
        [0.9, 0.8, 0.7, 0.6],
        [0.5, 0.4, 0.3, 0.2],
    ]
)

W_k = np.array(
    [
        [0.1, 0.3, 0.5, 0.7],
        [0.6, 0.4, 0.2, 0.1],
        [0.8, 0.9, 0.7, 0.6],
        [0.2, 0.1, 0.3, 0.4],
    ]
)

W_v = np.array(
    [
        [0.3, 0.5, 0.7, 0.9],
        [0.6, 0.4, 0.2, 0.1],
        [0.8, 0.9, 0.7, 0.6],
        [0.5, 0.4, 0.3, 0.2],
    ]
)

# Compute Query, Key, and Value matrices
Q = S @ W_q
K = S @ W_k
V = S @ W_v

Next, we compute attention scores by multiplying $Q$ and $K^T$, then applying softmax.

# Compute scaled dot-product attention
d = Q.shape[1]  # Feature dimension
attention_scores = Q @ K.T / np.sqrt(d)

def softmax(x):
    """Compute softmax values for each set of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True)

attention_weights = softmax(attention_scores)

The attention weight would be: $$\text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{4}}\right) = \begin{bmatrix} 0.324 & 0.467 & 0.209\ 0.305 & 0.515 & 0.180\ 0.346 & 0.432 & 0.222 \end{bmatrix}.$$ The final output captures how each word relates to every other word in the sentence. In this case, “sleeps” pays most attention to “cat” (0.432), some attention to “The” (0.346), and less attention to itself (0.222). Finally, we use these scores to create a weighted sum of the values:

output = attention_weights @ V

The final output is $$\text{output} = \begin{bmatrix} 1.536 & 1.519 & 1.265 & 1.157\ 1.566 & 1.536 & 1.261 & 1.137\ 1.512 & 1.507 & 1.269 & 1.174 \end{bmatrix}.$$