NLP

Unveiling the Potential of Synthetic Text Data: Evaluating LLM-Generated Benchmarks for NLP Tasks

Written by: Gaurav Maheshwari, Dmitry Ivanov, Theo Johannet, Kevin El Haddad

In the ever-evolving field of Natural Language Processing (NLP), one challenge stands out: the need for high-quality, annotated data. In industries like healthcare 🏥 and banking 💳, privacy regulations and the high cost of manual annotation make acquiring such datasets daunting.

Large Language Models (LLMs) 🌟 have emerged as game-changers, excelling in zero-shot and few-shot learning. But with dozens of LLMs available, each with unique strengths, a new dilemma arises: how can practitioners choose the best LLM for a task when test data is limited or unavailable?

Can synthetic data predict the zero-shot performance of LLMs across different NLP tasks?

Our study explores the use of synthetic data—generated by LLMs—as a benchmark for evaluating model performance. If synthetic data can reliably act as a proxy for real-world benchmarks, practitioners could identify the most suitable LLM for their needs—even without access to annotated test data. 

But how reliable are these datasets? Can they truly mirror real-world data and serve as dependable benchmarks? To investigate, we conducted experiments using six datasets across three key NLP tasks:

  • Intent Detection 🤔
  • Text Similarity 🔗
  • Named Entity Recognition (NER) 🔤

To assess the representativeness of synthetic data, we examined two critical factors:

  1. Absolute Performance: Does a model’s performance on synthetic data align with its performance on real data?
  2. Relative Performance: Do model rankings remain consistent between synthetic and real datasets?

In addition, we introduced a novel metric—the bias factor—to quantify the potential bias introduced by using the same LLMs for generating the benchmark, and using it as a method.

Key Findings: Insights from the Study 💡

Our experiments revealed fascinating insights into the utility of synthetic datasets:

  1. 📈 Task Complexity Matters
    • Synthetic data is highly effective as a benchmark for simpler tasks like intent detection, offering reliable performance predictions.
    • For complex tasks like NER, its reliability diminishes due to the intricacies involved in real-world scenarios.
  2. 🤝 No Single LLM Dominates
    • Different LLMs excel in generating synthetic data for different tasks.
    • Combining synthetic data from multiple LLMs results in more robust benchmarks, mitigating individual model limitations.
  3. 🎯 Bias in Smaller Models
    • Smaller LLMs tend to favor their own synthetic datasets, performing better on the data they generate themselves.
    • Larger LLMs, on the other hand, show minimal bias, making them more dependable for both data generation and task evaluation.

What This Means for Practitioners and Researchers 🌍

Our findings highlight both the promise and limitations of synthetic data as a benchmarking tool:

  • For simpler tasks, synthetic datasets offer a cost-effective, privacy-conscious alternative to real-world data with the generated dataset being a reliable source of benchmark.
  • For complex scenarios, practitioners must exercise caution and consider generating synthetic data using multiple LLMs to improve reliability.

Related articles

NLP

Should Encoders Be Trained Only with Masked Language Modeling? Not So Fast.

NLP

EuroBERT: A Refreshed Encoder to Tackle Business Problems

NLP

Applying Knowledge Distillation to LLMs with Different Vocabularies: the ULD Loss