In Machine Learning (ML), a Label is the “answer” or ground truth associated with a single piece of input data. It is the target variable that the model is trained to predict.
Labels are central to Supervised Learning, where the model learns by mapping input features to these known output values.
Context: Relation to LLMs and Data Curation
Labels define the entire training process for any task where a Large Language Model (LLM) needs to learn a specific, desired output, particularly during the crucial Fine-Tuning and alignment stages.
1. Types of Labels
The nature of the label depends on the task the model is performing:
| ML Task | Label Type | Example | LLM/Search Application |
| Classification | Discrete, categorical value | Spam / Not Spam, Positive / Negative, Category A / Category B. | Sentiment analysis, user intent classification. |
| Regression | Continuous numerical value | A house price of $350,000, a temperature of 25°C. | Predicting a search document’s Relevance score. |
| Sequence-to-Sequence | Another sequence of text | The translated sentence, the summarized paragraph. | Machine Translation (MT), summarization. |
2. Labels in LLM Training
- Self-Supervised Labels: During the initial, massive Pre-training of an LLM, the model creates its own labels from the raw text, making the process self-supervised.
- Causal Language Models (GPT): The label for each word is simply the next word in the sequence.
- Masked Language Modeling (MLM): The label is the original word that was masked with the
[MASK]Token.
- Human-Annotated Labels (Fine-Tuning): After pre-training, LLMs are Fine-Tuned on high-quality, human-labeled data to specialize the model.
- Instruction Tuning: The human-written “desired answer” to a given “prompt” is the label.
- Reinforcement Learning from Human Feedback (RLHF): Humans provide a preference label (ranking one model response as better than another), which is used to train a Reward Model.
3. Label Quality and GEO
The quality of the labels in the Training Set is the single most important factor determining the final performance of a supervised model. Poorly or inconsistently labeled data (known as label Noise) can cause the model to learn incorrect patterns, leading to low Generalization and suboptimal results in applications like Generative Engine Optimization (GEO). Techniques like Label Smoothing are used to mitigate the negative effects of this noise.
Related Terms
- Training Set: The collection of input features and their corresponding labels used to teach the model.
- Loss Function: The mathematical function that measures the difference between the model’s prediction and the true label.
- Ground Truth: The term for the accurate, verified label data.