Stride is a Hyperparameter in machine learning, most prominently associated with Convolutional Neural Networks (CNNs). It defines the size of the step the filter (or kernel) takes as it slides across the input data (such as an image or a sequence of Vector Embeddings). The stride value determines the amount of overlap between successive applications of the filter.
Context: Relation to LLMs and Search
While stride is less central to the Transformer Architecture that powers most modern Large Language Models (LLMs), the concept of stepping through a sequence remains highly relevant in data preprocessing and in older or specialized sequential models used in Generative Engine Optimization (GEO).
- Sequential Data Segmentation: In Tokenization and preparing text for LLMs, stride is used to create overlapping windows of text. When a long document is broken into chunks to fit the LLM’s Context Window, a common technique is to overlap the chunks by a certain number of tokens (the stride). This overlap ensures that semantic relationships are not abruptly cut off at the boundaries, improving the coherence of Retrieval-Augmented Generation (RAG).
- 1D Convolutions in NLP: Older NLP models, or certain components within hybrid LLM architectures, use 1D CNNs to extract features from word sequences. In this context, the stride dictates how quickly the filter moves over the sequence of Word Embeddings. A larger stride reduces the sequence length and computational complexity at the expense of potentially missing fine-grained local patterns.
- Efficient Feature Mapping: In CNNs, a larger stride results in a smaller output feature map, reducing the number of calculations required for subsequent layers. A stride of 1 means the filter moves one unit at a time, resulting in maximum overlap and the largest output. A stride of 2 means the filter skips every other unit.
The Mechanics: Stride and Output Size
The stride value ($S$) affects the dimensions of the output volume ($O$) from a convolutional layer. For a 1D convolution over a sequence of length $W$ with a filter size $F$ and a padding $P$:
$$O = \left\lfloor \frac{W – F + 2P}{S} \right\rfloor + 1$$
- Example (1D Sequence): If a sequence of length $W=10$ is processed by a filter of size $F=3$ with no padding ($P=0$):
- Stride $S=1$: $O = \lfloor \frac{10 – 3 + 0}{1} \rfloor + 1 = 8$. The filter is applied 8 times.
- Stride $S=2$: $O = \lfloor \frac{10 – 3 + 0}{2} \rfloor + 1 = \lfloor 3.5 \rfloor + 1 = 4$. The filter is applied 4 times, skipping every other position.
Related Terms
- Convolutional Neural Network (CNN): The primary architecture where the concept of stride is applied.
- Padding: A related hyperparameter that adds extra zeros around the input, often used to ensure the output size remains the same or to allow for larger strides.
- Context Window: The size constraint that necessitates document segmentation and the use of stride for overlapping chunks.