Log-Likelihood is a mathematical transformation of the Likelihood Function, a key concept in statistics and machine learning used for Parameter estimation. The Likelihood Function calculates the probability of observing the entire Training Set given a specific set of model Weights. The Log-Likelihood is simply the natural logarithm of this likelihood.
In deep learning, the objective of Optimization algorithms like Gradient Descent is to find the parameters that maximize the Log-Likelihood of the data.
Context: Relation to LLMs and Maximum Likelihood
Maximizing the Log-Likelihood is the core training objective of all modern Large Language Models (LLMs) during their Pre-training phase, a principle known as Maximum Likelihood Estimation.
- Computational Necessity: The Likelihood function for a massive LLM is the product of millions or billions of small probabilities (one for each predicted Token in the Training Set). Multiplying many small numbers quickly results in a value so tiny that it causes numerical underflow (it becomes zero for practical computing purposes). Taking the logarithm converts this product into a sum:$$\text{Likelihood} = \prod P_i \quad \rightarrow \quad \text{Log-Likelihood} = \sum \log(P_i)$$This transformation ensures numerical stability, which is essential for training models with billions of parameters.
- The Bridge to Loss Functions: In practice, instead of maximizing the Log-Likelihood, LLM training minimizes the Negative Log-Likelihood (NLL). This is a common convention because most Optimization algorithms are designed for minimization.$$\text{Minimize } \text{NLL} = -\text{Log-Likelihood} = -\sum \log(P_i)$$For Classification tasks like next-token prediction, this Negative Log-Likelihood is mathematically identical to the Cross-Entropy Loss, making Cross-Entropy the Loss Function used for LLM pre-training.
Log-Likelihood in Evaluation
Log-Likelihood is also used as an intrinsic metric to evaluate the quality of a trained language model:
- Perplexity: The quality of a language model is often measured using Perplexity (PPL). Perplexity is a direct function of the Log-Likelihood:$$\text{Perplexity} = 2^{(-\frac{1}{N} \times \text{Log-Likelihood})}$$where $N$ is the number of tokens. A lower perplexity score indicates a higher Log-Likelihood (meaning the model assigned a higher probability to the true sequence of words), signifying a better, more accurate language model.
Related Terms
- Maximum Likelihood: The statistical principle that drives LLM training, implemented by maximizing the Log-Likelihood.
- Cross-Entropy Loss: The Loss Function that is equivalent to the Negative Log-Likelihood for classification.
- Pre-training: The initial stage of LLM development where the Log-Likelihood is maximized.