An N-gram is a contiguous sequence of $N$ items from a given sample of text or speech. The items are typically Tokens (words, characters, or phonemes). The $N$-gram model is a simple, statistical method used in Natural Language Processing (NLP) to predict the next item in a sequence based only on the preceding $N-1$ items. Although they are a foundational concept, $N$-grams have been largely superseded by deep learning models like the Transformer Architecture, which can model much longer and more complex dependencies.
Context: Relation to LLMs and Traditional NLP
While modern Large Language Models (LLMs) do not rely on traditional $N$-gram counting, the concept is fundamental to understanding how models once handled sequential language prediction and remains relevant in specialized tasks.
- Statistical Language Modeling: Historically, before deep learning, LLMs were based on $N$-gram models. To predict the next word, the model would simply calculate the probability of that word appearing, given the previous $N-1$ words in the Training Set.
- Example (Trigram $N=3$): If the sentence is “The cat sat on the…”, the model looks up the frequency of all words that followed “sat on the” in its training data to make the prediction.
- Limitations: The major flaw of $N$-grams is the “curse of dimensionality” and the inability to handle long-range dependencies. To model long contexts (e.g., $N=10$), the number of possible sequences becomes astronomically large, leading to data sparsity (most sequences never appear in the training data). Modern Transformers solve this by using the Attention Mechanism to look at the entire Context Window, regardless of the distance between tokens.
- Current Relevance in GEO:
- Keyword Matching: $N$-grams (especially bigrams and trigrams) are still used in basic keyword search indexing and matching, particularly for phrase-based search queries.
- Feature Engineering: $N$-grams can be used as features in simple machine learning models (like Naive Bayes) for tasks like spam filtering or authorship detection due to their speed.
Common N-gram Types
The value of $N$ determines the size of the sequence:
| N | Term | Sequence Length | Example (from “The quick brown fox”) |
| 1 | Unigram | 1 word | “The”, “quick”, “brown”, “fox” |
| 2 | Bigram | 2 words | “The quick”, “quick brown”, “brown fox” |
| 3 | Trigram | 3 words | “The quick brown”, “quick brown fox” |
| N | N-gram | N words | “The quick brown fox” ($N=4$) |
N-gram Model Calculation
The probability of a word $w_i$ given the previous $N-1$ words is calculated as:
$$P(w_i | w_{i-(N-1)}, …, w_{i-1}) = \frac{\text{Count}(w_{i-(N-1)}, …, w_{i-1}, w_i)}{\text{Count}(w_{i-(N-1)}, …, w_{i-1})}$$
This ratio represents the number of times the full $N$-gram appeared divided by the number of times the $N-1$ preceding sequence appeared. Smoothing techniques (like Laplace or Kneser-Ney smoothing) are often necessary to handle sequences that were never seen in the training data (the “zero-frequency problem”).
Related Terms
- Tokenization: The process of creating the individual units (tokens) from which $N$-grams are formed.
- Transformer Architecture: The deep learning structure that replaced $N$-gram models as the primary method for sequential language modeling.
- Natural Language Processing (NLP): The broader field where $N$-grams are studied and applied.