Layer Normalization is a technique used in deep Neural Networks to normalize the inputs to the Activation Function within a layer. Unlike Batch Normalization (which normalizes across the samples in a Mini-Batch), Layer Normalization calculates the mean and variance of all the neurons in a single layer for a single training sample and uses those statistics to normalize the activation values.
This process ensures that the inputs to the subsequent layers remain in a stable, predictable range, which dramatically accelerates and stabilizes the Training process, especially in sequence models.
Context: Relation to LLMs and The Transformer Architecture
Layer Normalization is a mandatory and critical component of the Transformer Architecture that underpins all modern Large Language Models (LLMs) (like BERT and GPT).
- Need for Normalization in Sequence Models: Layer Normalization was invented specifically to address the instability of sequence models like Recurrent Neural Networks (RNNs) and, later, Transformers. Batch Normalization is ill-suited for these models because the length of the input sequence varies (meaning the batch statistics are unstable) and it’s impractical to apply during the Inference phase where the batch size is often 1. Layer Normalization solves this by performing the computation independently of the batch size.
- Stabilizing the Transformer: In the Transformer Architecture, Layer Normalization is applied at two specific points within every single encoder and decoder block:
- After the Attention Mechanism (Self-Attention or Cross-Attention) submodule.
- After the final MLP (Multi-Layer Perceptron) submodule.
- Tackling Internal Covariate Shift: Like other normalization techniques, Layer Normalization helps combat Internal Covariate Shift—the phenomenon where the distribution of activation inputs changes dramatically across layers and throughout training. By keeping these inputs stable, the model can use a much higher Learning Rate, leading to faster and more reliable Optimization using Gradient Descent.
The Layer Normalization Process
For a given input $\mathbf{x}$ (a vector representing the activations of all neurons in a layer for a single input), Layer Normalization is applied as follows:
- Calculate Mean and Variance: The mean ($\mu$) and variance ($\sigma^2$) are calculated across all elements of the input vector $\mathbf{x}$.
- Normalize: The input vector $\mathbf{x}$ is normalized using the calculated mean and variance, where $\epsilon$ is a small constant to prevent division by zero:$$\hat{\mathbf{x}} = \frac{\mathbf{x} – \mu}{\sqrt{\sigma^2 + \epsilon}}$$
- Scale and Shift: The normalized vector $\hat{\mathbf{x}}$ is then scaled and shifted using two learned Parameters: a scaling factor ($\gamma$) and a shifting factor ($\beta$):$$\mathbf{y} = \gamma \hat{\mathbf{x}} + \beta$$These learned parameters restore the representational power of the network, allowing it to adapt the normalized data to the optimal range for the subsequent layer.
Related Terms
- Transformer Architecture: The core LLM design where Layer Normalization is essential.
- Activation Function: The function whose inputs are stabilized by the Layer Normalization process.
- Batch Normalization: The alternative normalization technique that works across a batch of samples, not a single layer, making it unsuitable for Transformers.