Label Smoothing is a regularization technique used during the Training of Classification models, particularly Large Language Models (LLMs). It modifies the target labels (the ground truth) used in the Loss Function (usually Cross-Entropy Loss).
Instead of using hard labels—where the correct class is assigned a probability of 1.0 and all other classes 0.0—Label Smoothing introduces soft labels. It slightly reduces the probability of the correct class and distributes the remainder of that probability mass across all other classes. This reduces the confidence gap the model needs to achieve, preventing the model from becoming overly certain and rigid.
Context: Relation to LLMs and Overfitting
Label Smoothing is one of the essential techniques, alongside Dropout and Layer Normalization, used to stabilize and improve the Generalization of modern Transformer Architecture models.
- Preventing Over-Confidence: When a model is trained using hard labels, it is strongly encouraged to output a final probability distribution that is extremely close to the true label (e.g., [1.0, 0.0, 0.0]). This extreme push toward maximizing confidence can lead to Overfitting by penalizing any deviation too harshly. Label Smoothing prevents this by turning the target from [1.0, 0.0, 0.0] into a “smoother” distribution like [0.9, 0.05, 0.05].
- Impact on LLM Pre-training: For tasks like next-token prediction during LLM Pre-training, the model is trying to predict the exact next word out of a vocabulary of tens of thousands of potential words. Label Smoothing ensures that the model doesn’t allocate all its effort to predicting the exact one correct word with 100% confidence. Instead, it encourages the model to spread some probability mass to other semantically plausible or related words. This, in turn, improves the model’s ability to generalize to unseen text.
- Taming Logits: Label smoothing indirectly prevents the model’s output Logits (the raw scores) from becoming too large. Large logits can cause numerical instability and lead to poor calibration. By softening the target, the gradients that flow back through the network become less extreme, resulting in more stable Optimization.
The Label Smoothing Formula
The smoothed label $y’_{k}$ for a class $k$ is calculated based on the original one-hot label $y_{k}$ and a smoothing factor $\epsilon$ (epsilon, a Hyperparameter, usually set to a small value like 0.1):
$$y’_{k} = (1 – \epsilon) y_{k} + \frac{\epsilon}{C}$$
Where:
- $y_{k}$ is the original hard label (1 for the correct class, 0 otherwise).
- $C$ is the total number of classes (the vocabulary size for an LLM).
Example (C=3, $\epsilon$=0.1, True Class 1):
- Original Label: $[1.0, 0.0, 0.0]$
- Smoothed Label:
- $y’_{1} = (1 – 0.1)(1) + \frac{0.1}{3} \approx 0.9 + 0.0333 = \mathbf{0.9333}$
- $y’_{2} = (1 – 0.1)(0) + \frac{0.1}{3} \approx \mathbf{0.0333}$
- $y’_{3} = (1 – 0.1)(0) + \frac{0.1}{3} \approx \mathbf{0.0333}$
- Final Smoothed Target: $[\mathbf{0.9333}, \mathbf{0.0333}, \mathbf{0.0333}]$
Related Terms
- Overfitting: The model failure that Label Smoothing is designed to mitigate.
- Cross-Entropy Loss: The Loss Function whose input labels are modified by Label Smoothing.
- Regularization: The general class of techniques to which Label Smoothing belongs, aimed at improving model Generalization.