Transformer Architecture

Explain the transformer architecture in an easy-to-understand way, describing exactly what happens during training and inference and going into concepts like attention, keys and values, in less than 1000 words.

30.7s $0.017 Tokens i/o: 49/1,155 Nov 21st

The Transformer Architecture Explained

The Big Picture

A transformer is a neural network that processes sequences (like sentences) by figuring out which parts should pay attention to which other parts. Think of it like reading a sentence and automatically knowing which words relate to each other.

Core Components

Embeddings: Words are converted into vectors (lists of numbers). The word "cat" might become [0.2, -0.5, 0.8, ...]. Position information is added so the model knows word order.

Attention Mechanism: This is the transformer's superpower. It lets each word "look at" all other words and decide which ones are important for understanding it.

How Attention Works (The Key/Query/Value System)

Imagine you're in a library looking for books:

  • Query: What you're looking for ("I need books about cats")
  • Keys: The label on each book's spine
  • Values: The actual content inside the books

For each word, the transformer creates three vectors:

  1. Query vector: "What am I looking for?"
  2. Key vector: "What do I contain?"
  3. Value vector: "Here's my actual information"

The attention process:

  1. Take a word's query and compare it (dot product) with every word's key
  2. This gives scores showing how relevant each word is
  3. Apply softmax to convert scores into probabilities (summing to 1)
  4. Use these probabilities to take a weighted average of all the value vectors

Example: In "The cat sat on the mat," when processing "sat," the attention mechanism might look heavily at "cat" (the subject) and "mat" (where the sitting happened), ignoring "the."

Multi-Head Attention

Instead of one attention mechanism, transformers use multiple (8-16 "heads") running in parallel. Each head can learn different relationships:

  • Head 1: subject-verb relationships
  • Head 2: adjective-noun pairs
  • Head 3: positional relationships

The outputs are concatenated and combined.

The Full Architecture

Encoder (for understanding input):

  • Multi-head attention layer
  • Feed-forward neural network
  • Residual connections and normalization around each
  • Stacked 6-12 times

Decoder (for generating output):

  • Masked multi-head attention (can't see future words)
  • Multi-head attention over encoder outputs
  • Feed-forward network
  • Also stacked 6-12 times

Training

Data: Pairs of inputs and desired outputs. For translation: English sentences paired with French translations.

Forward Pass:

  1. Input sentence → embeddings
  2. Flow through encoder layers, each word gathering information from others via attention
  3. Encoder output → decoder
  4. Decoder generates output one word at a time
  5. Each decoder position uses attention to look at relevant encoder positions

Loss Calculation: Compare predicted words to actual target words using cross-entropy loss.

Backward Pass: Calculate gradients and update all the weight matrices (that create Q, K, V vectors, feed-forward networks, etc.) using backpropagation.

Teacher Forcing: During training, even if the decoder predicts the wrong word at position 3, we feed it the correct word when predicting position 4. This speeds up training.

Key Learning: The model learns weight matrices that produce useful Q, K, V vectors—ones where related words have high attention scores.

Inference (Using the Trained Model)

For tasks like translation:

  1. Encode the input sentence through the encoder once
  2. Start with a special [START] token
  3. Decoder predicts the first word using attention over encoder outputs
  4. Feed that predicted word back into the decoder
  5. Predict the second word
  6. Repeat until a [STOP] token is generated

For modern language models (GPT-style, decoder-only):

  1. Input your prompt
  2. Model predicts the next token using attention over all previous tokens
  3. Append that token to the sequence
  4. Predict the next token
  5. Continue until done

Masked Attention During Inference: When predicting position N, the model can only attend to positions 1 through N-1, never future positions (this was enforced during training too).

Why Transformers Work So Well

Parallelization: Unlike RNNs that process sequentially, all positions are processed simultaneously during training (though not during generation).

Long-range dependencies: Attention directly connects distant words, unlike RNNs where information degrades passing through many steps.

Flexibility: The same architecture works for translation, text generation, question answering, and more.

Key Takeaway

The transformer's genius is the attention mechanism: letting the model dynamically determine which parts of the input are relevant for each output position. Through training, it learns weight matrices that create query, key, and value vectors that capture meaningful linguistic relationships. During inference, it uses these learned patterns to process new inputs, generating outputs one piece at a time while attending to relevant context.

LINK COPIED!