Maximum Likelihood (ML), formally known as Maximum Likelihood Estimation (MLE), is a foundational statistical technique used to estimate the unknown Parameters of a probability distribution or a statistical model, given a set of observed data. The principle of ML is to find the set of model parameters that maximizes the probability (likelihood) of observing the training data that was actually collected.
In essence, ML asks: “Given the data we have, what are the model settings that make this data most probable?”
Context: Relation to LLMs and Pre-training
Maximum Likelihood Estimation is the core mathematical principle and Objective Function that governs the massive Pre-training phase of all modern Large Language Models (LLMs).
- The Core Task: Next-Token Prediction: LLMs (particularly decoder-only models like GPT) are trained via unsupervised learning to predict the next Token in a sequence. This prediction task is framed as a Maximum Likelihood problem: the model’s Weights are adjusted via Optimization (using Gradient Descent) to maximize the likelihood of predicting the correct next token (the one that actually occurred in the training text).
- Cross-Entropy Loss: The mathematical function used to implement ML for classification problems (like next-token prediction) is the Cross-Entropy Loss. By minimizing the Cross-Entropy Loss, the model is simultaneously maximizing the log-likelihood of the observed data. This function is perfectly suited for measuring how far the model’s predicted probability distribution is from the true probability distribution (which is 1 for the correct next token and 0 for all others).
- Training and Generalization: Maximizing the likelihood of the entire Training Set forces the Transformer Architecture to learn the deep structure, grammar, and Semantics of human language. This learned probability distribution over possible next words is what gives the LLM its ability to generate coherent and contextually appropriate text output.
Maximum Likelihood in the LLM Formula
In the context of language modeling, ML seeks to maximize the likelihood of a sequence of tokens $W = (w_1, w_2, \dots, w_n)$:
$$\text{Maximize } P(W | \theta) = P(w_1, w_2, \dots, w_n | \theta)$$
Using the chain rule of probability, this is broken down into a product of conditional probabilities:
$$\text{Maximize } \prod_{i=1}^{n} P(w_i | w_1, \dots, w_{i-1}, \theta)$$
Where:
- $\theta$ represents the entire set of Parameters (Weights and Biases) in the LLM.
- $P(w_i | w_1, \dots, w_{i-1}, \theta)$ is the probability the model assigns to the next token $w_i$, given all the preceding tokens.
To avoid numerical underflow (multiplying many small probabilities) and for computational simplicity, the log-likelihood is maximized instead of the likelihood product:
$$\text{Maximize } \sum_{i=1}^{n} \log P(w_i | w_1, \dots, w_{i-1}, \theta)$$
Maximizing the sum of log-probabilities is equivalent to minimizing the Negative Log Likelihood (NLL), which is precisely what the Cross-Entropy Loss achieves.
Related Terms
- Cross-Entropy Loss: The specific Loss Function that implements Maximum Likelihood for language models.
- Objective Function: The general term for the function that ML seeks to optimize.
- Pre-training: The phase of LLM development entirely driven by Maximum Likelihood Estimation.