Masked Language Modeling (MLM)

Masked Language Modeling (MLM) is a powerful self-supervised Pre-training technique used to Train encoder-based Large Language Models (LLMs), most notably BERT. In MLM, a small percentage of the Tokens in the input text are randomly replaced with a special “[MASK]” token. The model is then trained to predict the original identity of the masked tokens based on the surrounding context.

MLM forces the model to learn deep, bidirectional linguistic understanding, where the prediction is informed by words both preceding and following the masked word. This is a critical distinction from traditional, unidirectional (left-to-right) language modeling.

Context: Relation to LLMs and Natural Language Understanding (NLU)

MLM is the primary reason why encoder models like BERT excel at Natural Language Understanding (NLU) tasks and power modern search systems.

Bidirectional Context: Unlike decoder-only LLMs (like GPT), which are primarily trained for Natural Language Generation (NLG) by predicting the next token sequentially, MLM trains the model to read the entire sentence at once. This ability to integrate information from the future context (words to the right) is crucial for tasks requiring deep semantic understanding, such as:
- Disambiguation: Correctly determining the meaning of a word based on the full sentence context.
- Coreference Resolution: Linking pronouns (e.g., “he,” “it”) back to the correct named entities.
Neural Search and Relevance: The encoder models used in Neural Search (e.g., the ranking models used by major search engines) are almost universally pre-trained with MLM. This provides them with the superior ability to encode the complex Semantics of both a document and a query into highly accurate Vector Embeddings, ensuring high Relevance in retrieval.
Self-Supervised Learning: MLM is an elegant example of self-supervised learning. The training data is created automatically from the existing corpus by masking tokens; no human labeling is required, allowing the model to be trained on the vast, unfiltered scale of the internet.

MLM Implementation

The MLM process involves three key steps:

Masking: Typically, 15% of the input tokens are selected for masking.
Noise Mitigation: To prevent the model from simply learning to ignore the [MASK] token, the chosen tokens are handled in a specific way (usually in a 80-10-10 split):
- 80% are replaced with the [MASK] token.
- 10% are replaced with a random token from the vocabulary.
- 10% are left unchanged.
Prediction: The output layer of the model attempts to predict the original word for all masked positions using a softmax over the entire vocabulary. The Loss Function (usually Cross-Entropy Loss) is only calculated over the masked positions.

MLM and the Transformer Architecture

MLM is only possible because the Attention Mechanism in the Transformer Architecture can process all tokens simultaneously. This allows the model to look forward and backward in the sentence, a capability that older Recurrent Neural Network (RNN) architectures struggled with.

Related Terms

Pre-training: The foundational stage of LLM development where MLM is utilized.
Natural Language Understanding (NLU): The field of AI whose capabilities are significantly advanced by MLM.
Vector Embedding: The product of the MLM training—semantically rich numerical representations of text.

Appear More in
AI Engines

Dominate results in ChatGPT, Gemini & Claude. Contact us today.

This will take you to WhatsApp