Momentum is a technique used in the Optimization algorithms of deep learning, primarily to accelerate training and overcome obstacles like shallow local minima or plateaus in the Loss Function landscape. Inspired by the physical concept of momentum, the method incorporates a fraction of the previous update direction (the “velocity” of the parameters) into the current update step. This causes the model’s Weights to continue moving in the general direction of the minimum, even if the current gradient is small or contradictory.
Context: Relation to LLMs and Training Stability
Momentum is a core mechanism used within modern optimizers (such as Adam, which combines momentum with adaptive learning rates) to efficiently and stably train massive Large Language Models (LLMs) based on the Transformer Architecture.
- Speeding up Convergence: When training LLMs on massive datasets with billions of Parameters, every efficiency gain matters. Momentum smooths out the gradient updates, allowing the optimizer to take larger effective steps toward the global minimum, thereby reducing the total training time needed for Pre-training.
- Dampening Noise: During training, particularly when using small batch sizes or noisy data, the calculated Gradient Descent can fluctuate wildly. Momentum acts as a low-pass filter, averaging out the noise in the gradients and making the updates more consistent and stable. This is crucial for achieving high Generalization.
- Escaping Local Minima: The training landscape of deep Neural Networks is highly complex, filled with numerous local minima (points of low loss that are not the lowest overall loss). Momentum allows the optimizer to “roll through” or “over” minor local valleys by maintaining speed from previous steps, ultimately finding better minima where the model performs better.
The Momentum Update Rule
In standard Stochastic Gradient Descent (SGD), the parameter update is simple:
$$\theta \leftarrow \theta – \eta \nabla J(\theta)$$
With momentum, the update is split into two steps:
- Calculate Velocity ($v_t$): This is the moving average of past gradients.$$v_t = \gamma v_{t-1} + \eta \nabla J(\theta_t)$$
- Update Parameters ($\theta_t$): The parameters are updated by the newly calculated velocity vector.$$\theta_{t+1} = \theta_t – v_t$$
Where:
- $\theta$ are the model Weights.
- $\eta$ is the Learning Rate.
- $\nabla J(\theta_t)$ is the current gradient of the Loss Function.
- $\gamma$ (momentum coefficient) is a Hyperparameter (typically set between 0.9 and 0.99) that dictates how much of the previous velocity ($v_{t-1}$) is retained. A higher $\gamma$ means the network has more “inertia.”
Modern LLM training often uses optimizers like Adam or its variants, which cleverly integrate both momentum and adaptive learning rates (adjusting $\eta$ for each parameter) for state-of-the-art performance.
Related Terms
- Optimization: The overall process that momentum is designed to improve.
- Gradient Descent: The core algorithm that momentum modifies.
- Learning Rate: The size of the step in the gradient direction, which works in tandem with momentum.