Prediction is the core process in machine learning where a trained model uses new, unseen input data to forecast an outcome, estimate a numerical value, or determine the likelihood of a future event. In the context of Large Language Models (LLMs), prediction is fundamentally the task of determining the most probable next word or Token in a sequence, based on the statistical patterns learned during Pre-training.
Context: Relation to LLMs and Search
Prediction, often referred to as Inference in deep learning, is the operational phase of an LLM. It drives every output in a Generative Engine Optimization (GEO) system, from generating a Generative Snippet to ranking documents.
- LLM as a Predictor: An LLM based on the Transformer Architecture is, at its core, a sophisticated sequence predictor. When you ask an LLM a question, it predicts the sequence of words that, based on its training, most logically and coherently follow your input. This is achieved by iteratively calculating a Probability Distribution over its entire Vocabulary and sampling the most likely next token, repeating the process until an end-of-sequence token is predicted.
- Retrieval-Augmented Prediction: In a Retrieval-Augmented Generation (RAG) system, the LLM makes its prediction not just on its Prior Probability (general knowledge), but also on the fresh, factual evidence provided by the retrieved document chunks. This is a form of conditional prediction, where the outcome is conditioned on the retrieved context.
- Non-Generative Prediction: Prediction is also used in non-generative GEO tasks, such as:
- Reranking: A Ranking Algorithm predicts a continuous Relevance score for a document, which is a Regression task.
- Classification: A model predicts the user’s intent (e.g., informational vs. transactional), which is a Text Classification task.
The Prediction Process (Inference)
The prediction process for text generation involves the following stages:
- Input Conditioning: The prompt and context are converted into Vector Embeddings and passed through the model.
- Probability Calculation: The model uses its Weights to calculate a probability distribution for the next token, typically finalized using the Softmax Function.
- Sampling: A decoding strategy (like Greedy Search or Beam Search) selects the next token based on the probability distribution.
- Iteration: The newly selected token is appended to the input, and the entire process repeats until the desired length is reached or a stop token is encountered.
Prediction vs. Inference
While often used interchangeably, Prediction is the high-level outcome (the forecast or output), and Inference is the low-level, technical process of running the model’s forward pass to arrive at that prediction, with a focus on speed and efficiency.
Related Terms
- Inference: The technical process of prediction.
- Probability Distribution: The raw output of the model that the final prediction is sampled from.
- Generative Snippet: The final, synthesized output text that is the result of the prediction process.