A Local Minimum (or local optima) is a point on the landscape of a model’s Loss Function where the loss value is lower than at all nearby points, but it is not the lowest possible loss value across the entire space.
During the Training process, an Optimization algorithm like Gradient Descent seeks to find the lowest point on this surface—the Global Minimum. If the algorithm stops at a local minimum, the model has converged to a sub-optimal solution, resulting in worse performance and Generalization than if it had reached the global minimum.
Context: Relation to LLMs and Optimization
The existence of numerous local minima is a major challenge in training deep Large Language Models (LLMs), as the Loss Function landscape for models with billions of Parameters is extremely high-dimensional and complex.
- The Stagnation Risk: If the Gradient Descent algorithm lands in a local minimum, the surrounding gradient (slope) becomes zero or nearly zero. The algorithm interprets this as the “best” solution and stops updating the model’s Weights, leading to premature convergence and a poor final model.
- The Role of Modern Optimizers: Modern LLM training does not rely on simple Stochastic Gradient Descent (SGD), but on advanced optimizers (like Adam or its variants) that are specifically designed to escape local minima:
- Momentum: By incorporating the history of previous updates, Momentum allows the optimizer to “roll through” small local valleys and plateaus, maintaining enough “inertia” to continue searching for lower regions.
- Noise/Randomness: The slight Noise introduced by using small Mini-Batches (rather than the entire dataset) often helps push the model out of shallow local minima, guiding it toward a better solution.
- Saddle Points vs. Local Minima: While local minima are theoretically concerning, research has shown that in high-dimensional deep learning landscapes, the more common challenge is saddle points. A saddle point has a zero gradient but is only a minimum in some dimensions and a maximum in others. Advanced optimizers are generally very effective at navigating past these saddle points.
Visualizing the Problem
Imagine a mountainous terrain where the Global Minimum is the deepest valley. A Local Minimum is a smaller, shallow valley on the side of a mountain.
- If a hiker (the model) is using only local topographical information (the gradient) to find the lowest point, they might descend into the shallow valley (local minimum) and stop, mistakenly believing they have reached the lowest possible point, instead of finding the true lowest point (global minimum).
Related Terms
- Global Minimum: The point in the loss landscape that achieves the absolute lowest possible loss value.
- Gradient Descent: The Optimization algorithm that drives the model toward a minimum.
- Momentum: A technique used to overcome local minima and accelerate convergence.