‹ All prompts

Transformer Architecture

Explain the transformer architecture in an easy-to-understand way, describing exactly what happens during training and inference and going into concepts like attention, keys and values, in less than 1000 words.

30.6s $0.018 1,245 tokens

The Transformer Architecture Explained

The Big Picture

Imagine you're translating a sentence from English to French. A transformer is a neural network that reads the entire input sentence at once, figures out how each word relates to others, then generates the output word by word.

Core Components

1. Embeddings Words are converted into vectors (lists of numbers). The word "cat" might become [0.2, -0.5, 0.8, ...]. Position information is added so the model knows word order.

2. The Attention Mechanism This is the transformer's secret sauce. Think of it like a spotlight that helps each word "look at" other relevant words in the sentence.

For the sentence "The animal didn't cross the street because it was too tired":

  • When processing "it," attention helps the model look back at "animal" (not "street")
  • It learns this automatically from data

How Attention Works (Keys, Queries, Values)

Picture a library system:

Queries: What you're searching for Keys: Labels on filing cabinets
Values: The actual content inside

For each word, the transformer creates three vectors:

  • Query: "What am I looking for?"
  • Key: "What do I contain?"
  • Value: "Here's my actual information"

The Process:

  1. Take a word's query vector and compare it to all other words' key vectors
  2. This comparison produces attention scores (how relevant each word is)
  3. Use these scores to create a weighted mix of all words' value vectors
  4. This mixed vector becomes the word's new representation, enriched with context

Multi-Head Attention runs this process multiple times in parallel, letting the model attend to different aspects simultaneously (grammar, meaning, relationships, etc.).

The Architecture Stack

Encoder (for processing input):

  • Multiple identical layers stacked together
  • Each layer has:
    • Multi-head attention (words look at all input words)
    • Feed-forward neural network (processes each word independently)
    • Normalization and residual connections (for stable training)

Decoder (for generating output):

  • Also multiple stacked layers
  • Each layer has:
    • Masked self-attention (each word only looks at previous words)
    • Cross-attention (looks at the encoder's output)
    • Feed-forward network
    • Normalization and residual connections

Training

What happens:

  1. Feed in paired examples (e.g., English sentence + French translation)
  2. The encoder processes the English sentence
  3. The decoder receives the correct French translation but with each word masked from seeing future words
  4. At each position, the model predicts the next word
  5. Compare predictions to actual words and calculate error
  6. Use backpropagation to adjust all the weights (the numbers in queries, keys, values, etc.)
  7. Repeat millions of times with different examples

Key insight: During training, the decoder sees the correct previous words, even if it would have predicted wrong. This is called "teacher forcing."

Inference (Actually Using It)

Step-by-step:

  1. Encoder processes the input sentence once (e.g., "Hello" in English)
  2. Decoder starts with a special START token
  3. Decoder generates first word probability distribution, picks most likely word (e.g., "Bonjour")
  4. Feed "Bonjour" back into decoder
  5. Generate second word using START + "Bonjour" as context
  6. Keep going until model outputs an END token

Critical difference from training: During inference, the decoder only sees its own previous predictions, not the correct answers.

Why Transformers Work So Well

Parallelization: Unlike older RNNs that process words sequentially, transformers process all words simultaneously. This makes training much faster on modern GPUs.

Long-range dependencies: Attention can directly connect any two words, no matter how far apart. Previous architectures struggled with this.

Flexibility: The same architecture works for translation, text generation, question answering, and even vision tasks with minimal changes.

Modern Variants

Encoder-only (like BERT): Just use the encoder stack. Great for understanding tasks like classification or question answering.

Decoder-only (like GPT): Just use the decoder stack. Excellent for generation tasks. These models are trained to predict the next word given all previous words.

Encoder-decoder (original transformer): Best for transformation tasks like translation where you have distinct input and output.

Concrete Example

Input: "The cat sat"

  1. Embedding: Convert to vectors, add positions
  2. Encoder attention: "sat" attends to "cat" and "The" to understand it's the cat doing the sitting
  3. Encoder output: Rich representations containing context
  4. Decoder: Generates one word at a time
    • Looks at what it's generated so far
    • Attends to encoder output to understand source
    • Predicts next word
  5. Output: "Le chat s'est assis"

The beauty is that all these attention patterns—what words to focus on and when—are learned automatically from data, not hand-programmed.

Generated 26th Oct 2025