K-Nearest Neighbors (kNN) is a simple, non-parametric, and instance-based algorithm used for both Classification and Regression. It is considered a lazy learning algorithm because it does not generalize from the training data during the Training phase. Instead, it stores all the training data and only performs computation during the Inference (prediction) phase.
When making a prediction for a new data point, kNN finds the $K$ closest data points (the neighbors) in the Vector Embedding space.
- For Classification, the new point is assigned the class most common among its $K$ neighbors (a majority vote).
- For Regression, the new point is assigned the average value of its $K$ neighbors.
Context: Relation to LLMs and Neural Search
kNN is fundamentally related to modern Neural Search (Vector Search) because both techniques rely on finding the closest vectors in a high-dimensional Latent Space.
- The Conceptual Basis of Vector Search: Modern Generative Engine Optimization (GEO) uses retrieval systems that find documents based on the proximity of their Vector Embeddings to the query vector. This is essentially a massive, high-dimensional kNN search. The “neighbors” are the documents, and the metric is Cosine Similarity (or Euclidean distance). The best documents found are the $K$ nearest neighbors to the query in the semantic space.
- kNN for LLM Fine-Tuning: kNN can also be used to augment Large Language Models (LLMs) directly. In kNN-LMs, a model’s next-token prediction is supplemented by a kNN search over a large datastore of training samples. If the kNN search finds highly relevant text nearby in the Latent Space, the model can use this retrieved information to adjust its prediction, often leading to more factual and context-aware generations.
- Dimensionality and Computational Cost: While simple, kNN’s computational cost grows exponentially with the number of dimensions (the Curse of Dimensionality). Because modern LLM Vector Embeddings have hundreds or thousands of dimensions, real-world Neural Search systems use faster, approximate methods like Approximate Nearest Neighbors (ANN) rather than true brute-force kNN.
Key Components of kNN
- K (The Hyperparameter): The number of neighbors to consider. Choosing an optimal $K$ is crucial: a small $K$ makes the model sensitive to Noise (high variance), while a large $K$ can smooth out the decision boundaries, but cause Underfitting (high Bias).
- Distance Metric: The mathematical formula used to define “nearest.” Common choices are Euclidean distance (straight-line distance, standard for kNN) or Cosine Similarity (measures angle, common for Vector Embeddings).
Related Terms
- Neural Search: The modern, high-dimensional application of the kNN concept to text and data retrieval.
- Vector Embedding: The data representation format on which kNN and vector search operate.
- Distance Metric: The function (like Euclidean or Cosine) used to measure proximity between neighbors.