Step-by-Step GPT Model Coding
GPT (Generative Pretrained Transformer) architecture consists of embedding, layer normalization, masked multi-head attention, feedforward, linear projection, and softmax layers. These layers are organized into blocks that repeat N times depending on the model depth.
Model Configuration
Before coding a GPT model, you need to define the core hyperparameters. The example configuration below shows values suitable for a small-scale model.
- Vocabulary size: 50,257 (GPT-2 tokenizer)
- Context length: 256 tokens
- Embedding dimension: 384
- Number of attention heads: 12
- Number of layers: 12
- Dropout rate: 0.1
Embeddings
Embeddings convert discrete tokens into dense vectors that preserve semantic relationships. Two types of embeddings are used: token embeddings (word representations) and positional embeddings (position information). These two are summed to produce the final embedding vector.
The token embedding layer maps each token ID to a fixed-size vector. The positional embedding has the same dimension and encodes the sequential position (from 0 to context_length-1). The model carries both content and position information simultaneously by summing these two vectors.

Layer Normalization
Layer normalization stabilizes learning by normalizing activation distributions between layers. It prevents vanishing/exploding gradient problems and enables faster convergence in deep networks.
Unlike traditional batch normalization, layer normalization normalizes each sample independently, making it independent of batch size and much more suitable for language models.
Feedforward Network (MLP)
The MLP component applies two linear transformations with a GELU activation function between them. It expands the dimension 4x first and then reduces it back, learning complex representations.
GELU (Gaussian Error Linear Unit), unlike ReLU, allows small non-zero values for negative inputs, providing better gradient flow near the zero region.
Attention Mechanism
The attention mechanism is the heart of GPT. There are three types.
Self-Attention
Allows each token to understand its relationship with other tokens. A Query, Key, and Value triplet is computed; the dot product of Q and K is normalized to obtain attention weights, which are then used to compute a weighted sum of V.

Masked Self-Attention
Prevents looking at future tokens. During training, the model must only use what it has seen so far when predicting the next token. This constraint is enforced using a lower triangular mask.

Multi-Head Attention
Performs parallel attention computations with different weight projections. Each "head" interprets the data in a different subspace, allowing the model to learn different types of relationships simultaneously. The outputs of all heads are concatenated and passed through a final projection layer.

Transformer Block
A transformer block sequentially combines the following components:
- Pre-normalization (Layer Norm)
- Multi-head attention + residual connection
- Pre-normalization (Layer Norm)
- Feedforward (MLP) + residual connection
The pre-normalization approach makes training more stable than the original "Post-LN" Transformer and is preferred for large models.
Residual Connections
Residual connections add the layer's input directly to its output: output = layer(x) + x. This structure provides two critical advantages:
- Gradient flow: During backpropagation, gradients can flow directly by skipping layers, largely solving the vanishing gradient problem in deep networks.
- Ease of learning: The layer no longer needs to learn the full transformation, only the "residual" part, which simplifies optimization.

The Complete GPT Model
The final model combines all components in this order:
- Token embedding + Positional embedding → summed embedding
- N Transformer blocks (sequential)
- Final Layer Normalization
- Linear projection → vocabulary-sized logits
- Softmax → probability distribution
Cross-entropy loss is used during training; the goal is to correctly predict the next token. During inference, text is generated using strategies such as greedy decoding, top-k sampling, or nucleus (top-p) sampling.