Knowledge Distillation (KD) is a model compression technique in Machine Learning (ML) where the “knowledge” of a large, complex, and highly accurate model (Teacher Model) is transferred to a much smaller, faster, and more efficient model (Student Model).
The Student Model is trained to mimic the output behavior of the Teacher Model, not just on the hard, correct Labels (the ground truth), but also on the soft targets—the full probability distribution (the confidence scores) produced by the Teacher. This allows the small model to retain most of the large model’s accuracy while dramatically improving Inference speed and resource efficiency.
Context: Relation to LLMs and Efficiency
Knowledge Distillation is critical for deploying high-performance Large Language Models (LLMs) in real-world applications, especially in Generative Engine Optimization (GEO) where latency and cost are major concerns.
- The Efficiency Problem: Large LLMs, which are based on the Transformer Architecture, are computationally expensive due to their vast number of Parameters. Running them in production for tasks like Neural Search or real-time generation is often too slow and costly.
- The KD Solution: Knowledge Distillation creates smaller, “distilled” versions of the LLM that can run much faster on less powerful hardware (e.g., edge devices or typical CPU servers). The resulting Student Model is typically 10x to 100x smaller than the Teacher but retains 90% or more of the original performance.
- Example (BERT $\rightarrow$ DistilBERT): The widely known DistilBERT model is a student model distilled from the larger BERT (a key LLM encoder for search). DistilBERT is 40% smaller and 60% faster than BERT while retaining 97% of its language understanding capabilities.
- Transferring Soft Targets: The key benefit of KD is the use of soft targets. If the Teacher Model assigns a low probability (a soft target) to an incorrect answer, that tiny, nuanced information is still valuable for the Student Model. By training on the full, smoothed distribution of the Teacher’s Logits, the Student learns the relative similarity and complexity of classes, not just the single correct answer.
The Knowledge Distillation Process
Training the Student Model involves a specialized Loss Function that is a combination of two terms:
- Distillation Loss (Soft Targets): Measures the difference between the Teacher’s output probabilities and the Student’s output probabilities (usually using Kullback-Leibler (KL) Divergence)).
- Student Loss (Hard Labels): Measures the difference between the Student’s output and the true ground truth Labels (usually using Cross-Entropy Loss).
The training objective is to minimize the total loss, forcing the Student to reproduce the fine-grained knowledge and confidence scores of the superior Teacher Model.
Related Terms
- Inference: The operational phase whose speed is dramatically improved by using compressed Student Models.
- Transformer Architecture: The complex architecture of the Teacher Models that are often distilled.
- Kullback-Leibler (KL) Divergence): The common metric used to calculate the difference between the Teacher’s and Student’s output distributions.