Regression is a category of Supervised Learning tasks in machine learning where the goal is to predict a continuous numerical value (a quantity) rather than a discrete class label. Regression models learn the relationship between a set of input features (independent variables) and a continuous target variable (dependent variable) to forecast outcomes like stock prices, temperature, house values, or, in the context of Large Language Models (LLMs), quality scores.
Context: Relation to LLMs and Search
While Text Classification is more commonly associated with LLMs, regression tasks are fundamental for evaluating quality, prioritizing information, and optimizing the alignment of generative models in Generative Engine Optimization (GEO).
- Reward Model Scoring: The Reward Model (RM) used in Reinforcement Learning from Human Feedback (RLHF) is often trained as a regression model. Instead of classifying a generated response as “Good” or “Bad,” the RM predicts a continuous preference score (e.g., from 0.0 to 1.0). This numerical score acts as the reward signal for the LLM to maximize during its final training stage.
- Information Quality Assessment: Regression is used in the Retrieval component of a Retrieval-Augmented Generation (RAG) system to rerank retrieved documents. A specialized reranking model takes a query and a document chunk and outputs a continuous relevance score (e.g., 0.95), which is a regression task, to prioritize the most relevant context for the LLM’s Context Window.
- GEO Utility: Regression models provide a continuous measure of quality, which is more granular than simple classification. This allows GEO specialists to track incremental improvements in model performance and information quality more effectively.
Regression vs. Classification
The key distinction lies in the output variable:
| Feature | Regression | Classification |
| Output Type | Continuous numerical value (e.g., 45.7, -0.2, 10,000). | Discrete category or label (e.g., Spam, Not Spam; Positive, Negative). |
| LLM Application | Reward Model (RM) scoring, reranking relevance. | Sentiment Analysis, intent detection. |
| Error Metric | Mean Squared Error (MSE), Root Mean Squared Error (RMSE). | Cross-Entropy Loss, Accuracy. |
Types of Regression
While simple Linear Regression fits a straight line to the data, more complex models used in deep learning include:
- Neural Regression: A neural network (often with a single, linear output layer and no non-linear Activation Function like Sigmoid or Softmax) trained to predict the numerical target.
- Logistic Regression: Despite its name, Logistic Regression is a classification model because it uses the Sigmoid function to output a probability (a value between 0 and 1) for a binary outcome, not a continuous, unbound quantity.
Related Terms
- Supervised Learning (SL): The general training category that encompasses both regression and classification.
- Loss Function: The objective function used to train a regression model (e.g., minimizing the difference between the predicted value and the Ground Truth value).
- Vector Embedding: The input features used by the regression model (the numerical representation of the text).