LiveMCP-101: What General-Purpose Models Still Can’t Do
A Benchmark Built for Reality
LiveMCP-101, developed by researchers at Duke University and Zoom, includes 101 real-world requests that require coordinated use of multiple tools — web search, file manipulation, math operations, data analysis — through the Model Context Protocol (MCP). Every request underwent a rigorous process of iterative LLM revisions and thorough human reviews to guarantee both realism and complexity. Tasks are categorized into Easy, Medium, and Hard, and require an average of 5.4 tool calls (nearly double those in earlier benchmarks), highlighting the magnitude of the challenge.
Rather than scoring raw outputs, the benchmark evaluates models against reference execution plans, capturing how well agents follow a planned sequence in a dynamic environment. Assessors evaluate models against the reference plan and measure their adherence in real time.
Results That Hit a Ceiling

The strongest models today score below 60% across LiveMCP-101 tasks. Performance drops sharply when models must orchestrate multiple tools, maintain context across long sequences, or plan multi-step strategies. The analysis identifies recurring failure modes: loss of context as tasks unfold, incomplete or disorganized reasoning, and poor choices about which tools to call — or in what order.
The study also highlights inefficiencies in token usage, with unnecessary repetition and digressions that degrade performance and inflate costs. Notably, most tasks require 3 to 7 tool calls, with some reaching 15, placing significant pressure on a model’s ability to plan and execute a coherent sequence.
What These Findings Tell Us
The takeaway is clear: general-purpose models lack deep contextual grounding in specialized domains. Their broad training brings flexibility but limits depth. To enhance reliability, experts must train models on contextualized datasets that incorporate domain-specific vocabulary, realistic scenarios, and operational constraints.
Two levers stand out as essential: specialized training, which exposes the model to domain-relevant data, and strict scope control, which limits what the model can and cannot do. Together, these approaches dramatically improve precision, long-range coherence, and predictable behavior.
When Generalization Falls Short: What Real-World ChatGPT Usage Shows
What People Actually Do with ChatGPT
A FlowingData analysis published in September 2025 shows that ChatGPT usage falls into three categories: Asking (requesting information or advice), Doing (operational tasks like writing or planning), and Expressing (personal expression). About 49% of messages fall under Asking, 40% under Doing, and 11% under Expressing. A third of all Doing-related use cases are work-related.
Another study analyzing 1.1 million ChatGPT messages (OpenAI, Duke, Harvard) reveals that professional usage is declining: work-related messages dropped from 47% in June 2024 to 27% in June 2025. MarketWatch notes that nearly 80% of overall usage focuses on practical advice, information retrieval, and writing.
What This Says About General-Purpose Models
These numbers paint a consistent picture: users rely on ChatGPT primarily for non-specialized needs — general advice, quick information, or simple drafting. Only a small fraction of messages involve coding (4%) or complex professional tasks, confirming that general-purpose models work well for general-purpose needs, but struggle when technical depth or business logic comes into play.
According to The Washington Post , ChatGPT is often viewed as a general assistant rather than a specialized professional tool. The gap between perceived performance (smooth, quick responses) and actual performance in demanding situations continues to grow.
Moving Toward Better-Scoped, Better-Trained Models
These insights emphasize the significance of two strategic approaches: specialized learning grounded in contextualized, domain-specific data, and well-defined scope limitations that reduce error rates by narrowing the model’s expected capabilities. Together, they create models that are far more reliable in real-world enterprise settings. Shifting from generalization to specialization is essential to operationalizing AI at scale.
Specialized AI: A Driver of Business Transformation
From Generalization to Specialization
While general-purpose models perform well on simple tasks, their performance quickly drops in complex, industry-specific workflows. Diabolocom takes the opposite approach: start with the business, then build the AI.
Our models are trained on real contact center interactions, in multiple languages, annotated to reflect the structure, vocabulary, and intent patterns found in customer conversations. This domain-specific training captures the logic of live exchanges and the operational constraints of customer service — areas where generic LLMs lack grounding.
The architecture merges a robust base model with specialized modules for tasks such as transcription, intent classification, or action detection. Business-grade orchestration ensures the system remains coherent end-to-end, delivering responses that are reliable, contextualized, and tightly controlled.
A New Approach to Customer Interactions

The next wave of CX applications, like Diabolocom AI, take a different approach. They’re structured to predict and therefore enhance the customer experience throughout the communication journey. It integrates interaction analysis, performance evaluation, and real-time agent assistance—three complementary components that support teams in understanding, measuring, and improving each customer interaction.
Through automated transcription and categorization, conversations become structured data that reveal key topics and satisfaction signals. Performance analytics then help quantify the quality of engagements and guide team training. Finally, real-time assistance supports agents during live calls by suggesting responses or automating routine tasks.
This integrated approach enables Diabolocom to deliver AI that makes customer interactions smoother, clearer, and more personalized.
Small Language Models Are More Specialized
Model size is not the best indicator of performance. What matters most is the relevance of the model’s training and its ability to understand a specific business context.
Because it is trained on real-world data and backed by rigorous orchestration, a specialized model delivers both accuracy and speed. Its narrower scope reduces latency — a critical advantage in contact centers, where every millisecond matters.
Low latency unlocks true real-time assistance: the AI helps the agent during the call, proposes responses, reformulates messages, or initiates after-call documentation — all while feeding long-term analytics.
Specialized models bring deep contextual understanding, predictable behavior, and immediate responsiveness — essential qualities in environments where every interaction counts.
Beyond LLMs, SLMs Are The Preferred Deployment for CX
LiveMCP-101 — with top models scoring below 60% — highlights a structural weakness in general-purpose LLMs when facing demanding, multi-step scenarios. Meanwhile, ChatGPT usage data (800 million weekly users, 2 to 2.5 billion daily requests, declining professional usage) shows that these systems are primarily used for general tasks, signaling that future developments will likely continue in that direction.
In this landscape, performance is no longer about how large a model is. It’s about how well it is trained for a specific purpose. Specialized training and domain-level fine-tuning are becoming the decisive levers for turning powerful models into reliable, scalable enterprise tools. Diabolocom’s approach illustrates this shift: the strongest path forward isn’t for AI to do everything — but to do the right things exceptionally well within a clearly defined domain.
Discover our specialized AI for CX