NLP

Curate–Train–Refine: A Closed-Loop Agentic Framework for Zero Shot Classification

In the rapidly evolving landscape of Customer Experience (CX), businesses are facing a critical infrastructure dilemma. Large Language Models (LLMs) have revolutionized the field with their ability to understand complex requests “out of the box” (zero shot), but they come with significant baggage: high latency, massive compute costs, and the operational complexity of managing giant models. On the flip side, compact models like BERT or SetFit offer the speed and cost efficiency required for real time applications, yet they traditionally fail without thousands of human labeled training examples.

This tradeoff is further complicated by strict data sovereignty and privacy requirements, particularly for European enterprises. Sending sensitive customer interactions to external LLM providers for real time classification is often a compliance non-starter. Conversely, training a local model traditionally requires accessing that same sensitive data for manual labeling, creating a privacy bottleneck. Organizations are frequently left choosing between a smart model they can’t secure and a secure model they can’t easily train.

At Diabolocom Research, we propose a framework that eliminates this compromise. In our latest paper, “Curate–Train–Refine: A Closed-Loop Agentic Framework for Zero Shot Classification,” we introduce an agentic approach that leverages the intelligence of LLMs without the privacy risks of live deployment. By using an LLM offline as a “data curator” rather than an inference engine, we can synthesize high quality training data to build specialized, compact models. This method delivers the zero shot intelligence of a giant model with the speed, sovereignty, and efficiency of a lightweight local classifier.

The Approach: A Self-Correcting Agentic Loop

To bridge the gap between heavy LLMs and efficient local models, we developed an Agentic Closed-Loop Framework. In traditional synthetic data approaches, you prompt an LLM once, generate a static dataset, and hope for the best. Our method differs fundamentally: we treat data curation as an active, iterative process where an AI Agent continuously monitors and improves the training quality based on real feedback. The system doesn’t just “generate” data; it reasons about why the model makes the error and how to correct it.

How the Agentic System Works

We operationalize this through a cyclic process we call Curate–Train–Refine. More specifically, we initialise the agent with the classifier model as one of the tools. Here is the step-by-step breakdown:

1. Initialization: Setting the Standard

Before training begins, the Agent analyzes the label names and descriptions to generate a provisional validation set. Since we are operating in a “zero-shot” environment with no human data, the agent creates its own diverse, balanced “Gold Standard” to measure progress against.

2. The Feedback Cycle

Once the standard is set, the agent enters a continuous loop of improvement:

  • Generate (Curate): It synthesizes an initial batch of training examples focusing on linguistic diversity to ensure the model sees various ways a user might phrase a request.
  • Train & Evaluate: The target classifier (e.g., SetFit) is trained on this batch and evaluated against the validation set. Crucially, we don’t just look at the accuracy score, but we capture detailed diagnostics, including specific examples the classifier got wrong.
  • Analyze (The Agentic Core): The Agent then reads the “error logs” like a human data scientist would. It identifies patterns in the failures:
    • Is the model struggling with negation?
    • Is it confusing two specific labels?
    • Did it fail on long, complex sentences?
  • Refine: Based on this analysis, the Agent synthesizes targeted hard examples to address these specific weaknesses. It deliberately creates data that forces the model to learn the boundaries it previously missed.

This cycle repeats automatically. In each iteration, the training data becomes increasingly tailored to the model’s blind spots. Once the agent detects that performance has plateaued, the loop terminates. This results in a highly specialized training set that has been dynamically “fitted” to the model architecture. This allows a small, efficient model to achieve performance levels that typically require massive human-labeled datasets.

The Results: High Performance with Lower Complexity

We evaluated our Curate–Train–Refine framework across four diverse benchmarks, ranging from fine-grained sentiment analysis to complex topic classification. Our goal was to determine if a compact model, trained via an agentic loop, could compete with larger pre-trained models and standard baselines. The findings demonstrate that intelligent data curation can effectively replace the need for massive model parameters or extensive human labeling.

1. The Impact of the Agentic Loop

To understand the value of our iterative process, we compared three distinct approaches:

  • Baseline: A standard classifier trained on available data.
  • +Prompt: A classifier trained on a single batch of static, LLM-generated data.
  • +Manager: Our proposed method, where the agent iteratively analyzes errors and refines the training data.

The results show that while static prompting improves performance over the baseline, the Manager (our agentic loop) consistently delivers the highest accuracy. This advantage is particularly evident on complex tasks. For instance, on the AG News benchmark, which requires distinguishing between nuanced topics like World vs. Business, our method achieved 82.6% accuracy, significantly outperforming the static prompting approach (73.0%). This confirms that for non-trivial tasks, an active feedback loop is essential to capture complex label semantics that a one-off generation often misses.

2. Zero-Shot and Few-Shot Efficiency

We also compared our compact model (110M parameters) against state-of-the-art baselines in both zero-shot and few-shot settings.

  • Zero-Shot Performance: In scenarios with no human-labeled data, our method outperformed GliClass 3.0 on 3 out of 4 datasets. Notably, we achieved these results using a model that is approximately 73% smaller (110M vs. 151M parameters), translating directly to lower inference latency in production.
  • Comparing with Human Data: Perhaps most significantly, our “zero-shot” agent, which utilized no real human examples, often matched or exceeded the performance of standard models trained on real human data (up to 8 examples per class). This indicates that high-quality, synthetic supervision can be a viable and effective alternative to collecting small batches of human-labeled data.

Conclusion

We have introduced a closed-loop framework where an LLM acts as a data curator rather than a predictor, iteratively refining training data to build specialized, compact models. This approach successfully decouples high-level intelligence from deployment, allowing organizations to achieve state-of-the-art accuracy and data privacy without the latency or cost of large models.

Interested in the details?

Related content

NLP

Unlocking LLM Performance: A Principled Analysis of Chain-of-Thought Reasoning

NLP

Should Encoders Be Trained Only with Masked Language Modeling? Not So Fast.

NLP

EuroBERT: A Refreshed Encoder to Tackle Business Problems