NLP

Unveiling the Potential of Synthetic Text Data: Evaluating LLM-Generated Benchmarks for NLP Tasks

Written by: Gaurav Maheshwari, Dmitry Ivanov, Theo Johannet, Kevin El Haddad

In the ever-evolving field of Natural Language Processing (NLP), one challenge stands out: the need for high-quality, annotated data. In industries like healthcare 🏥 and banking 💳, privacy regulations and the high cost of manual annotation make acquiring such datasets daunting.

Large Language Models (LLMs) 🌟 have emerged as game-changers, excelling in zero-shot and few-shot learning. But with dozens of LLMs available, each with unique strengths, a new dilemma arises: how can practitioners choose the best LLM for a task when test data is limited or unavailable?

Can synthetic data predict the zero-shot performance of LLMs across different NLP tasks?

Our study explores the use of synthetic data—generated by LLMs—as a benchmark for evaluating model performance. If synthetic data can reliably act as a proxy for real-world benchmarks, practitioners could identify the most suitable LLM for their needs—even without access to annotated test data. 

But how reliable are these datasets? Can they truly mirror real-world data and serve as dependable benchmarks? To investigate, we conducted experiments using six datasets across three key NLP tasks:

  • Intent Detection 🤔
  • Text Similarity 🔗
  • Named Entity Recognition (NER) 🔤

To assess the representativeness of synthetic data, we examined two critical factors:

  1. Absolute Performance: Does a model’s performance on synthetic data align with its performance on real data?
  2. Relative Performance: Do model rankings remain consistent between synthetic and real datasets?

In addition, we introduced a novel metric—the bias factor—to quantify the potential bias introduced by using the same LLMs for generating the benchmark, and using it as a method.

Key Findings: Insights from the Study 💡

Our experiments revealed fascinating insights into the utility of synthetic datasets:

  1. 📈 Task Complexity Matters
    • Synthetic data is highly effective as a benchmark for simpler tasks like intent detection, offering reliable performance predictions.
    • For complex tasks like NER, its reliability diminishes due to the intricacies involved in real-world scenarios.
  2. 🤝 No Single LLM Dominates
    • Different LLMs excel in generating synthetic data for different tasks.
    • Combining synthetic data from multiple LLMs results in more robust benchmarks, mitigating individual model limitations.
  3. 🎯 Bias in Smaller Models
    • Smaller LLMs tend to favor their own synthetic datasets, performing better on the data they generate themselves.
    • Larger LLMs, on the other hand, show minimal bias, making them more dependable for both data generation and task evaluation.

What This Means for Practitioners and Researchers 🌍

Our findings highlight both the promise and limitations of synthetic data as a benchmarking tool:

  • For simpler tasks, synthetic datasets offer a cost-effective, privacy-conscious alternative to real-world data with the generated dataset being a reliable source of benchmark.
  • For complex scenarios, practitioners must exercise caution and consider generating synthetic data using multiple LLMs to improve reliability.

Related articles

NLP

EuroBERT: A Refreshed Encoder to Tackle Business Problems

Read more
NLP

Applying Knowledge Distillation to LLMs with Different Vocabularies: the ULD Loss

Read more