GPT Training Process: Data Sources and Techniques
Coding a GPT model is not just about building the architecture. For the model to learn in a meaningful way, correct data preparation, loss function selection, and a properly structured training loop are essential. In this article, we'll cover all of these steps.
Generating Text from the Model
Before moving on to training, it's important to understand how the model generates text. At each step, the model computes a probability distribution over the next token and samples from this distribution to select the new token; this process continues in an autoregressive manner.
A simple generation function follows these steps: convert the text to token IDs, get logits from the model, project the last position's logit to vocabulary size, apply softmax, and select the highest-probability token (greedy) or sample.
Dataset Preparation
Language model pre-training uses raw text rather than labeled data, making GPT a "self-supervised" model. Data is typically split 90% for training and 10% for validation.
Sliding Window Approach
The raw text is tokenized once, then fixed-length windows are slid to create input-target pairs. The target sequence is shifted exactly one position forward from the input sequence; in other words, the model always learns the task of "predict the next token."

The advantage of this approach is that it maximizes efficiency: the same token can serve as both input and target in different windows.
PyTorch Dataset and DataLoader
PyTorch's Dataset and DataLoader classes are used for data preparation. A custom GPTDataset class takes tokenized text and produces input-target tensors using the sliding window method. The DataLoader splits this dataset into mini-batches with shuffle and batch_size parameters.
Loss Computation
The standard loss function in GPT training is cross-entropy. At each position, the model's predicted token probability distribution is compared to the actual next token.
PyTorch's torch.nn.functional.cross_entropy() function takes logits and target indices, computing softmax internally — which is advantageous for numerical stability.
Training Function
The main training loop follows these steps.
Epoch Loop
All training data is iterated over in each epoch. Mini-batches are processed sequentially: forward pass, loss computation, backward pass, and weight update.
Optimizer: AdamW
AdamW is an improved version of the Adam optimizer with a weight decay correction. The combination of adaptive learning rates and weight decay provides good generalization in large models like GPT.
Evaluation and Sampling
At regular intervals (every N batches), loss is computed on both the training and validation sets, and the model's current state is observed by generating text from a prompt. This allows close monitoring of whether the model is genuinely learning.
Results
In the author's experiment, the model was trained on approximately 492,914 tokens over 5 epochs on an Nvidia RTX 4050 GPU, taking about 6 minutes.
Before training, the model produced random words; after training, it began forming coherent Turkish sentences. The training and validation loss curves falling in parallel showed that the model was not memorizing (no overfitting).
This result proves that even a small GPT model can learn meaningful language patterns with the right data preparation and training loop.
