Unveiling the Potential of Synthetic Text Data: Evaluating LLM-Generated Benchmarks for NLP Tasks

Written by: Gaurav Maheshwari, Dmitry Ivanov, Theo Johannet, Kevin El Haddad

In the ever-evolving field of Natural Language Processing (NLP), one challenge stands out: the need for high-quality, annotated data. In industries like healthcare 🏥 and banking 💳, privacy regulations and the high cost of manual annotation make acquiring such datasets daunting.

Large Language Models (LLMs) 🌟 have emerged as game-changers, excelling in zero-shot and few-shot learning. But with dozens of LLMs available, each with unique strengths, a new dilemma arises: how can practitioners choose the best LLM for a task when test data is limited or unavailable?

Can synthetic data predict the zero-shot performance of LLMs across different NLP tasks?

Our study explores the use of synthetic data—generated by LLMs—as a benchmark for evaluating model performance. If synthetic data can reliably act as a proxy for real-world benchmarks, practitioners could identify the most suitable LLM for their needs—even without access to annotated test data.

But how reliable are these datasets? Can they truly mirror real-world data and serve as dependable benchmarks? To investigate, we conducted experiments using six datasets across three key NLP tasks:

Intent Detection 🤔
Text Similarity 🔗
Named Entity Recognition (NER) 🔤

To assess the representativeness of synthetic data, we examined two critical factors:

Absolute Performance: Does a model’s performance on synthetic data align with its performance on real data?
Relative Performance: Do model rankings remain consistent between synthetic and real datasets?

In addition, we introduced a novel metric—the bias factor—to quantify the potential bias introduced by using the same LLMs for generating the benchmark, and using it as a method.

Key Findings: Insights from the Study 💡

Our experiments revealed fascinating insights into the utility of synthetic datasets:

📈 Task Complexity Matters
- Synthetic data is highly effective as a benchmark for simpler tasks like intent detection, offering reliable performance predictions.
- For complex tasks like NER, its reliability diminishes due to the intricacies involved in real-world scenarios.
🤝 No Single LLM Dominates
- Different LLMs excel in generating synthetic data for different tasks.
- Combining synthetic data from multiple LLMs results in more robust benchmarks, mitigating individual model limitations.
🎯 Bias in Smaller Models
- Smaller LLMs tend to favor their own synthetic datasets, performing better on the data they generate themselves.
- Larger LLMs, on the other hand, show minimal bias, making them more dependable for both data generation and task evaluation.

What This Means for Practitioners and Researchers 🌍

Our findings highlight both the promise and limitations of synthetic data as a benchmarking tool:

For simpler tasks, synthetic datasets offer a cost-effective, privacy-conscious alternative to real-world data with the generated dataset being a reliable source of benchmark.
For complex scenarios, practitioners must exercise caution and consider generating synthetic data using multiple LLMs to improve reliability.

⭐

Project’s GitHub

📄

Read our full paper

12/31/2024