The Learning Rate (LR) is the most important Hyperparameter in machine learning that determines the step size taken during the Optimization process. Specifically, it dictates how much the model’s internal Weights are adjusted with respect to the gradient (slope) of the Loss Function.
In the Gradient Descent algorithm, the learning rate controls the magnitude of the movement down the loss landscape in the direction of steepest descent.
Context: Relation to LLMs and Optimization
The learning rate is absolutely critical for successfully Training large-scale Large Language Models (LLMs), as an incorrect value can prevent convergence or cause the model to diverge entirely.
- Formulaic Role: The update rule for a model’s weights ($W$) using Gradient Descent is defined as:$$W_{\text{new}} = W_{\text{old}} – \eta \times \nabla J(W)$$Where:
- $\mathbf{\eta}$ is the Learning Rate.
- $\mathbf{\nabla J(W)}$ is the gradient (the partial derivatives of the Loss Function with respect to the weights).
- Impact of LR Choice:
- High Learning Rate (Too Large): The model takes huge steps, often overshooting the minimum point in the loss landscape. This prevents the model from converging, causing the loss to oscillate wildly or even diverge (get worse) entirely.
- Low Learning Rate (Too Small): The model takes tiny steps. Convergence is guaranteed but is extremely slow, making the massive Pre-training of LLMs economically infeasible. It also increases the risk of the model getting trapped in a shallow Local Minimum.
Learning Rate Scheduling (LLM Best Practice)
For modern LLMs, a single, fixed learning rate is never used. Instead, a Learning Rate Scheduler is employed to dynamically adjust $\eta$ over the course of Training:
- Warmup: The LR starts at a very small value (near zero) and gradually increases over the first few thousand steps. This is critical for stabilizing the training of large Transformer Architecture models, which are highly sensitive to initial learning steps.
- Decay: Once the LR reaches its peak, it is slowly decreased (or “decayed”) over the remaining training steps. This allows the model to take large, efficient steps initially, and then take smaller, more precise steps later to carefully settle into the optimal region of the loss landscape.
This scheduling is a primary reason LLMs can effectively minimize the complex, high-dimensional loss functions involved in tasks like Maximum Likelihood estimation.
Related Terms
- Gradient Descent: The Optimization algorithm that the learning rate controls.
- Hyperparameter: The classification of the learning rate as a value set before training begins.
- Local Minimum: A sub-optimal point in the loss landscape that a model with too small a learning rate might get stuck in.