Should Encoders Be Trained Only with Masked Language Modeling? Not So Fast.

Author: Nicolas Boizard

When building AI models that understand and process text—like virtual assistants or search engines—how we pretrain them is crucial. One common approach is to use encoder models trained with a technique called Masked Language Modeling (MLM). But recent research is shaking things up, showing that another strategy, Causal Language Modeling (CLM)—usually used in decoder models like ChatGPT—might be just as powerful, or even better, in some cases.

This blog breaks down the differences between encoder and decoder architectures, explains what MLM and CLM are, and summarizes what our latest paper reveals about how these strategies can be combined for better performance.

1. Encoder vs Decoder: What’s the Difference?

In natural language processing (NLP), there are two main types of architectures:

Encoders: These models take in a full sentence and learn a rich internal representation of its meaning. They’re great at understanding text and are typically used for tasks like classification, information retrieval, and translation.
Decoders: These models generate text, one word at a time, based on the previous words. They are used for generative tasks like writing stories or answering questions. But because they “think out loud” word by word, they also learn good representations of language in the process.

🧠 Pros and Cons

Architecture	Pros	Cons
Encoder	Strong at understanding and reasoning over full input	Needs special objectives like MLM
Decoder	Naturally suited for generation; scalable	Not traditionally used for "understanding" tasks

2. What Are MLM and CLM?

Let’s simplify:

🔍 Masked Language Modeling (MLM)

MLM teaches a model to predict missing words in a sentence. For example:

“The cat sat on the [MASK].”

The model learns to guess “mat”.

This forces the model to understand the sentence structure and meaning.

🔁 Causal Language Modeling (CLM)

CLM teaches a model to predict the next word, given the previous ones. For example:

“The cat sat on the” → model predicts: “mat”.

This is how models like GPT-3 are trained. It feels more “natural” but is traditionally used for generation, not understanding.

3. What Our Research Found: CLM + MLM Beats MLM Alone

Traditionally, encoder models are trained only with MLM. But is that the best approach?

We tested this by training over 30 models, using equal data and compute, across a wide range of tasks. We compared:

Pretraining only with MLM
Pretraining only with CLM
Hybrid approaches: mixing CLM and MLM in different proportions
Two-stage setups: starting with CLM, then switching to MLM

🔬 Key Findings:

Hybrid objectives outperform pure MLM: Whether it’s 50-50 or 25-75, combining CLM and MLM leads to better results in downstream tasks.
CLM-first training is powerful: Pretraining with CLM first, then switching to MLM (a strategy called continuous pretraining) often leads to better final performance—even better than MLM from scratch.
Better encoders from decoders: Surprisingly, starting from a decoder model trained with CLM and then adapting it with MLM creates a better encoder than one trained only with MLM.

These results suggest that the traditional approach of pretraining encoders with MLM alone may be suboptimal—and that mixing in CLM can lead to better, cheaper, and more efficient models.

4. Open Source and What’s Next

We’re releasing everything to the community to foster transparency and reproducibility:

📄 Read the paper
💻 Get the training code

This opens the door to more experiments, especially in areas like Vision-Language Models, many of which rely on decoder-based architectures. The blending of CLM and MLM might enhance their ability to learn rich and useful representations.

🚀 Conclusion

Pretraining with MLM alone is not enough. Our work shows that incorporating CLM—especially at the beginning of training—can lead to significantly better results across tasks. This challenges the traditional view of how encoder models should be trained and points to more flexible and efficient strategies.

We’re excited to see how the community builds on this. Try the models, read the paper, and explore the code—we’re open-sourcing it all.

📄

Read our full paper

08/22/2025