Padding is a crucial Preprocessing technique in machine learning, particularly in Natural Language Processing (NLP) and deep learning. It involves adding a specific, non-meaningful placeholder value (a “pad token”) to sequences that are shorter than a predetermined maximum length. The purpose is to make all input sequences uniform in length, which is a requirement for efficient parallel processing and batching in neural network architectures, especially the Transformer Architecture.
Context: Relation to LLMs and Search
Padding is a necessary evil that allows Large Language Models (LLMs) to handle the natural variability of human language while still leveraging the speed of vectorized hardware.
- Batch Processing: Modern LLMs are trained and run inference using batches—groups of multiple inputs processed simultaneously. To efficiently process a batch of sentences like:
- “What is RAG?” (3 tokens)
- “Explain the Transformer Architecture.” (5 tokens)
- “The dog ran.” (3 tokens)the system must make them the same length. If the maximum length (max_len) for the batch is 5 tokens, the shorter sequences must be padded to 5.
- The Pad Token: The placeholder used for padding is a special, dedicated Token (e.g.,
[PAD]or token ID 0). It is essential that the model’s Attention Mechanism learns to ignore these tokens to prevent the padding from introducing noise or skewing the calculation of Vector Embeddings and Prediction. - Attention Mask: To ensure the model ignores pad tokens, an accompanying attention mask is created. This mask is a binary vector (1s for real tokens, 0s for pad tokens) that forces the Self-Attention Mechanism to assign zero Weights to the pad tokens during computation.
Padding and Truncation
Padding is closely related to the concept of truncation, and both are handled based on the Context Window or the defined max_len:
| Operation | Purpose | Context Window Relationship |
| Padding | Adds pad tokens to short sequences. | Used when sequence length < max_len (or Context Window). |
| Truncation | Cuts off excess tokens from long sequences. | Used when sequence length > max_len (or Context Window). |
In Retrieval-Augmented Generation (RAG), inputs (queries and retrieved passages) are often concatenated and padded up to the LLM’s full Context Window to maximize the amount of information the model can process for generating the final Generative Snippet.
Padding Side
Padding can be applied to either the front (left) or the back (right) of a sequence:
- Right Padding: Placing pad tokens at the end of the sentence. This is the common default for many models (like BERT) because it aligns with how text is naturally read and generated.
- Left Padding: Placing pad tokens at the beginning of the sentence. This is often preferred for autoregressive (decoder-only) models (like GPT) during inference, as it keeps the real content at the end of the sequence, closer to where the generation process begins.
Related Terms
- Tokenization: The step that converts words into the numerical inputs that are then padded.
- Context Window: The maximum length that determines how much padding or truncation is needed.
- Preprocessing: The general phase of the workflow where padding is applied.