The attention mechanism is a cornerstone of transformer models, enabling them to excel in tasks like language translation, text summarization, and more. In this article, we’ll break down how attention works in transformers and provide Python code using NumPy to illustrate the key components. We'll also discuss the assumptions and simplifications made along the way to help you understand what's happening behind the scenes.
What is Attention?
At its core, attention allows the model to focus on different parts of the input sequence when processing each word. This is especially useful in understanding context and relationships between words. Because it helps the model assign appropriate importance to each word based on its relevance. Before the attention mechanism the models would try to see everything at once and would fail to see the important bits.
Key Components of Attention
Input Representation
Self-Attention Calculation
Multi-Head Attention
Feedforward Neural Network
Let’s dive into each component with some Python code, keeping in mind the underlying simplifications!
1. Input Representation
First, we convert words into numerical representations (embeddings). These embeddings capture some of the semantic information of each word. For simplicity, let’s assume we have a small vocabulary where each word is represented by a fixed-length vector.
Assumptions and Simplifications
Small Vocabulary: We're using only four words with manually assigned embeddings. In real applications, models use large, pre-trained embeddings (e.g., Word2Vec or embeddings learned during training) that capture much richer information.
One-Hot Encoding-Like Representation: Here, the vectors are quite simple and resemble one-hot encodings with minor variations. Actual word embeddings are dense vectors, often of much higher dimensionality.
import numpy as np
# Example embeddings for 4 words
word_embeddings = {
'the': np.array([1, 0, 0]),
'cat': np.array([0, 1, 0]),
'sat': np.array([0, 0, 1]),
'on': np.array([1, 1, 0])
}
# Convert words to embeddings
def get_embedding(word):
return word_embeddings.get(word)
# Example usage
sentence = ['the', 'cat', 'sat', 'on']
embeddings = np.array([get_embedding(word) for word in sentence])
print("Embeddings:\n", embeddings)
2. Self-Attention Calculation
Next, we compute self-attention scores. Each word will pay attention to every other word in the sentence, including itself. This helps in capturing relationships between different words based on their context.
Assumptions and Simplifications
Dot Product for Similarity: We use a simple dot product to calculate similarity scores between words. In real transformers, this process involves separate weight matrices for queries, keys, and values, which allow the model to learn different aspects of similarity.
No Scaling: In practice, the dot product is scaled by the square root of the dimensionality of the key vectors to prevent extremely large values, which we omit here for simplicity.
def softmax(x):
return np.exp(x) / np.sum(np.exp(x), axis=0)
def self_attention(embeddings):
# Calculate dot product of embeddings
scores = np.dot(embeddings, embeddings.T)
# Apply softmax to get attention weights
attention_weights = softmax(scores)
# Calculate the weighted sum of embeddings
output = np.dot(attention_weights, embeddings)
return output, attention_weights
# Calculate self-attention
output, attention_weights = self_attention(embeddings)
print("Attention Weights:\n", attention_weights)
print("Self-Attention Output:\n", output)
3. Multi-Head Attention
In practice, transformers use multiple heads to capture different relationships in the data. Each head can focus on different aspects of the input, which improves the model's ability to understand complex patterns.
Assumptions and Simplifications
Independent Heads with Same Calculations: In this example, we calculate multiple heads by simply repeating the self-attention process. In actual transformers, each head has its own learned parameters, allowing them to focus on different features of the input.
Concatenation: We concatenate the outputs from different heads, but in real implementations, these outputs are projected back into the original space using learned weight matrices.
def multi_head_attention(embeddings, num_heads=2):
head_outputs = []
for _ in range(num_heads):
output, _ = self_attention(embeddings)
head_outputs.append(output)
# Concatenate outputs from all heads
return np.concatenate(head_outputs)
# Calculate multi-head attention
multi_head_output = multi_head_attention(embeddings)
print("Multi-Head Attention Output:\n", multi_head_output)
4. Feedforward Neural Network
After obtaining the multi-head output, we pass it through a feedforward neural network. This step helps in transforming the attended features into a richer representation.
Assumptions and Simplifications
Simple Network Structure: We use a basic feedforward network with one hidden layer. In practice, transformers use more sophisticated architectures, often with multiple layers and normalization steps (e.g., LayerNorm) to stabilize training.
Random Weights: The weights here are randomly generated, whereas real models learn these weights during training to optimize performance on specific tasks.
def feedforward_network(x):
# Simple feedforward network with one hidden layer
# Hidden layer weights
W1 = np.random.rand(x.shape[1], x.shape[1] * 2)
# Hidden layer biases
b1 = np.random.rand(x.shape[1] * 2)
# Output layer weights
W2 = np.random.rand(x.shape[1] * 2, x.shape[1])
# Output layer biases
b2 = np.random.rand(x.shape[1])
# ReLU activation
hidden_layer = np.maximum(0, np.dot(x, W1) + b1)
output_layer = np.dot(hidden_layer, W2) + b2
return output_layer
# Pass through feedforward network
final_output = feedforward_network(multi_head_output)
print("Final Output:\n", final_output)
Conclusion
The attention mechanism allows transformers to understand context and relationships between words effectively. By implementing self-attention and multi-head attention using NumPy, we can see how these components work together to process language.
Assumptions Recap
We used fixed, simple embeddings rather than learned, high-dimensional vectors.
The self-attention calculation was simplified, omitting key-query-value transformations and scaling.
Multi-head attention was implemented without learned parameters for each head.
The feedforward network was basic, lacking many components used in real transformers.
This simplified version helps build intuition for the attention mechanism, but real transformer models are far more complex and powerful. Try to explore these concepts further on your own or in my LinkedIn!
For more insights into AI and machine learning concepts like this one, subscribe to my Substack! 🚀