A mini-batch (often simply called a batch) is a small, randomly selected subset of the total Training Set used to compute the Gradient Descent update during a single step of Training a Neural Network. Processing data in mini-batches is the foundation of Stochastic Gradient Descent (SGD) and its variants, as it provides a practical compromise between computational efficiency and the quality of the gradient estimate.
The size of the mini-batch (the batch size) is a crucial Hyperparameter that significantly impacts training speed, stability, and memory consumption.
Context: Relation to LLMs and Computational Constraints
Mini-batches are essential for training Large Language Models (LLMs) because the enormous size of the models and the training data makes processing the entire dataset at once (known as Batch Gradient Descent) computationally impossible and memory-prohibitive.
- Computational Efficiency: When processing a mini-batch, the computation (matrix multiplication) can be highly parallelized across the modern GPU and TPU hardware used for LLM training. This parallelization is far more efficient than processing individual data points (Stochastic Gradient Descent) or the entire dataset at once. This speed gain is fundamental to the ability to train billion-parameter Transformer Architecture models.
- Gradient Stability vs. Speed:
- Large Batch Size: Provides a more accurate estimate of the true gradient of the entire dataset, leading to more stable updates and faster convergence in fewer steps. However, large batches require massive memory and can sometimes generalize worse (known as the “generalization gap”).
- Small Batch Size: Introduces more Noise into the gradient estimate (because the sample is smaller), which can help the model escape sharp local minima and lead to better Generalization. However, training is less computationally efficient and may take more steps to converge.
- Effective Batch Size in LLMs: Due to memory constraints, the actual batch size that fits on a single device might be small. LLM training often uses a technique called Gradient Accumulation, where the gradients from several consecutive mini-batches are accumulated before a single Parameter update is made. This allows researchers to achieve a large “effective batch size” without requiring a single, monolithic, memory-consuming batch.
Three Types of Gradient Descent
The choice of batch size defines the type of gradient descent used:
| Method | Batch Size | Frequency of Update | Trade-offs |
| Batch Gradient Descent | All training data | One update per Epoch | High computational cost; stable, accurate gradient. |
| Stochastic Gradient Descent (SGD) | 1 data point | One update per data point | High variance (noisy updates); slow parallelization. |
| Mini-Batch Gradient Descent | $N$ data points ($N > 1$ and $N < \text{Total Data Size}$) | One update per mini-batch | Best balance of stability, efficiency, and generalization. |
Related Terms
- Stochastic Gradient Descent (SGD): The Optimization algorithm that uses mini-batches.
- Epoch: One complete pass through the entire Training Set, comprising multiple mini-batch updates.
- Hyperparameter: The classification of batch size as a variable set by the user (rather than learned by the model).