Demystifying the Transformer

The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need," represents a paradigm shift in artificial intelligence and natural language processing. It moved away from sequential processing (like RNNs) to a parallelized approach centered around a powerful concept called the **self-attention mechanism**. This innovation unlocked the ability to train much larger models and capture complex, long-range dependencies in data.

This interactive guide is designed to break down the Transformer into its core components. Use the navigation on the left to explore each piece of the architecture. Each section provides a clear explanation of its purpose, an interactive visualization to demonstrate its function, and real-world analogies to make the concepts more intuitive. The goal is to build a solid understanding of how these powerful models work from the ground up.

Key Innovations

  • Self-Attention: Allows every word in a sentence to weigh the importance of every other word, creating a rich contextual understanding.
  • Parallel Processing: Processes all input tokens at once, making training dramatically faster and more efficient than sequential models.
  • Positional Encodings: Injects information about word order, a crucial feature that attention mechanisms lack on their own.
  • Encoder-Decoder Stack: A modular design that allows for deep, hierarchical learning and is highly adaptable for various tasks.

High-Level Architecture

The Transformer is fundamentally an **encoder-decoder** model. The encoder's job is to process the entire input sentence (e.g., in English) and compress it into a rich numerical representation that captures its meaning and context. The decoder then takes this representation and generates the output sentence (e.g., in French), one word at a time. Both the encoder and decoder are actually stacks of multiple identical layers (the original paper used 6), allowing the model to learn increasingly abstract features.

Explore the diagram below. The key information flow is from the input, through the encoder stack, then to the decoder stack, which also considers its own previous outputs to generate the next word. This structure is highly effective for sequence-to-sequence tasks like machine translation.

Input Sentence
Encoder Block (Nx)
...
Final Encoder
Output Sentence (so far)
Decoder Block (Nx)
...
Final Decoder
Predicted Next Word
The final representation from the Encoder is fed into every Decoder block.

Input Preparation: Tokens & Embeddings

Transformers don't work with raw text. The first step is to break the input sentence into smaller pieces called **tokens**. These can be words or sub-word units. Each token is then mapped to a numerical vector called an **embedding**. This vector, which is learned during training, captures the token's semantic meaning. For example, the vectors for "king" and "queen" would be closer to each other in the vector space than to "apple".

The quick brown fox

Real-World Analogy

Think of token embeddings like a dictionary that translates words not into definitions, but into coordinates on a map. Words with similar meanings are placed close together on this "meaning map." The model uses these coordinates to understand relationships between words based on their proximity.

Positional Encoding

Because the attention mechanism processes all words at once, it has no inherent sense of word order. "The dog chased the cat" and "The cat chased the dog" would look identical to it. To solve this, we inject **positional encodings**. These are vectors created using sine and cosine functions of different frequencies. This vector is added to each token's embedding, giving the model a unique signal for each position in the sequence.

Visualization of sine/cosine values for the first few dimensions across 20 positions.

Real-World Analogies

  • Different Pitches: Imagine people saying the same word, but each person at a different position in a line speaks at a unique pitch. The word is the same (the token embedding), but the pitch (the positional encoding) tells you where they are in the line.
  • Binary Counting: Conceptually similar to how bits flip at different rates in binary counting. Some dimensions of the encoding change value at every position, while others change more slowly, creating a unique signature for each position.

Multi-Head Self-Attention

This is the core of the Transformer. For each word, **self-attention** allows it to look at all other words in the input and decide which ones are most relevant. It does this by creating three vectors for each word: a **Query (Q)**, a **Key (K)**, and a **Value (V)**. The Query is like "what I'm looking for." The Key is like "what I contain." The model matches the Query of one word with the Keys of all other words. The strength of this match determines how much of each word's Value gets passed along.

**Multi-head** attention simply means doing this process multiple times in parallel with different Q, K, and V projections. This allows the model to learn different types of relationships simultaneously (e.g., one "head" might focus on grammatical relationships, another on semantic ones).

Click a word to see its attention scores

Web Search Analogy (Q, K, V)

  • Query (Q): Your search term in Google. It's what you are actively looking for.
  • Key (K): The titles or keywords of all the websites. They advertise what content each site has.
  • Value (V): The actual content of the websites. You retrieve this content when your Query matches a site's Key.

Position-wise Feed-Forward Network

After the attention mechanism gathers contextual information from across the sequence, the output for each word's position is passed through a **Feed-Forward Network (FFN)**. This component processes each position's vector independently. It consists of two linear layers with a ReLU activation in between. A key feature is that it first expands the dimensionality of the vector (e.g., from 512 to 2048) and then contracts it back down. This provides additional processing depth and non-linearity for each token representation.

Input (512-dim)
Expand (W1 + ReLU)
Contract (W2)

Analogy: Individual Expert Processing

If attention is like a research team gathering diverse information about a topic, the FFN is like an individual expert taking that gathered info for a *single point* and deeply processing it. The expert expands the idea into a richer conceptual space to explore its nuances (expansion layer), then distills it back into a concise, refined conclusion (contraction layer).

Add & Norm Sub-layer

Each major sub-layer in the Transformer (like Multi-Head Attention and the FFN) is wrapped in an "Add & Norm" component. This consists of two crucial operations: a **Residual Connection** and **Layer Normalization**.

  • Residual Connection (Add): This is a "skip connection" that takes the input to the sub-layer and adds it to the output of the sub-layer. This helps prevent the "vanishing gradient" problem in deep networks, making it easier to train them. It essentially ensures that the model doesn't "forget" the original information as it passes through transformations.
  • Layer Normalization (Norm): This operation stabilizes the training process by normalizing the values within each layer. It ensures the numbers don't get too large or too small, which can destabilize learning.
Input
Sub-layer (e.g., Attention)
+
Normalize
Output

Decoder Mechanisms

The decoder's job is to generate the output sequence one token at a time. It has a structure similar to the encoder but with two key differences in its attention mechanisms.

1. Masked Multi-Head Self-Attention

The decoder first performs self-attention on the text it has generated so far. However, it uses a **mask** to prevent positions from attending to subsequent positions. This is crucial because when predicting the next word, the model should only know about the words that came before it, not the ones that will come after. It stops the model from "cheating" by looking ahead.

2. Encoder-Decoder (Cross) Attention

This is where the encoder and decoder connect. This layer takes the output from the encoder as its **Keys and Values**, but it uses the output from the decoder's masked attention layer as its **Query**. This allows the decoder, at each step of generation, to look back at the entire input sentence and focus on the parts that are most relevant for predicting the next word.

For example, when translating "The cat sat on the mat" to French, as the decoder generates "Le chat s'est assis sur le", the cross-attention mechanism might strongly focus on the word "mat" in the input sentence to correctly predict "tapis" as the next word.

Final Output Layer

After passing through the final decoder block, the resulting vector needs to be turned into a predicted word. This is done in two final steps:

  • Linear Layer: A final linear layer acts as a classifier. It takes the vector from the decoder stack and expands its dimension to the size of the entire vocabulary (e.g., from 512 to 50,000+). The output is a raw score, or **logit**, for every possible word in the vocabulary.
  • Softmax Layer: The softmax function is applied to these logits. It converts the raw scores into a probability distribution, where all values are between 0 and 1 and sum to 1. The word with the highest probability is then chosen as the next word in the output sequence.

This process repeats, with the newly predicted word being fed back into the decoder as input for the next step, until the model predicts a special "end-of-sentence" token.

Decoder Output
(512-dim)
Linear
Logits
(Vocab-size dim)
Softmax
Probabilities
(Highest is chosen)

Evolution: Encoder-Only & Decoder-Only

The original Transformer had both an encoder and a decoder, which is ideal for sequence-to-sequence tasks. However, researchers found that the individual components are powerful on their own, leading to two major families of models:

Encoder-Only (e.g., BERT)

These models use only the encoder stack. Since the encoder looks at the entire sentence at once (bidirectionally), they are excellent at tasks that require a deep understanding of the full context of a sentence. They are not used for generating text from scratch.

Encoder Stack (Nx)

Use Cases:

  • Text Classification (Sentiment Analysis)
  • Question Answering
  • Named Entity Recognition

Decoder-Only (e.g., GPT)

These models use only the decoder stack. They are inherently generative. Since they are trained to predict the next word based only on previous words (auto-regressive), they are perfect for language modeling and text generation tasks.

Decoder Stack (Nx)

Use Cases:

  • Text Generation / Creative Writing
  • Chatbots & Conversational AI
  • Summarization