Logits are the raw, unnormalized prediction scores produced by the final linear layer of a Neural Network (such as a Large Language Model (LLM)) before they are converted into probabilities. In a Classification context, the number of logits is equal to the size of the model’s vocabulary (or the number of classes).
Because they are raw scores, logits can be any real number (positive or negative). They represent the model’s confidence in each possible class based on the input, but they are not directly interpretable as probabilities until they are passed through a softmax function.
Context: Relation to LLMs and Next-Token Prediction
In the Transformer Architecture that underpins all modern LLMs, the logits represent the critical final output used for both Training and Inference (generation).
- The Final Step of Generation: When an LLM is tasked with predicting the next Token (word or sub-word), the model’s complex internal layers process the input and output a final Vector Embedding for the predicted token. This vector is then multiplied by the model’s final unembedding matrix (a matrix with the size of the vocabulary) to produce the vector of logits.
- Transition to Probability: The logit vector is then passed through the Softmax Function to create the final probability distribution $P$ over the vocabulary:$$P_i = \frac{e^{\text{logit}_i}}{\sum_{j} e^{\text{logit}_j}}$$This ensures that all probabilities are positive and sum to 1. The token corresponding to the highest probability is typically selected (a process called greedy decoding), although sampling techniques may be used to introduce creativity.
- Training with Logits: During Training, the Loss Function (typically Cross-Entropy Loss) operates directly on the logits and the true labels (the next word in the original text). This is done because calculating the loss directly from the logits is often more numerically stable than calculating it from the final softmax probabilities.
Logits and Sampling Control
In the LLM Inference phase, logits are the key control point for influencing the model’s output quality and creativity:
- Temperature Parameter: The Temperature Hyperparameter is applied directly to the logits before the softmax function is run. A higher temperature increases the differences between the logits, making the probability distribution flatter and increasing the chances of the model sampling a less likely (more creative) token.
- Top-K/Top-P Sampling: These techniques restrict the set of logits considered for sampling. For example, Top-K only allows sampling from the $K$ tokens with the highest logit scores, effectively pruning the low-confidence choices.
By controlling the logits, users and developers can fine-tune the balance between the model’s deterministic accuracy and its creative variability.
Related Terms
- Cross-Entropy Loss: The loss function that processes logits during training.
- Temperature: The Hyperparameter applied to logits to control output randomness.
- Inference: The operational phase where logits are converted into probabilities for text generation.