Surprisingly enough, it seems some AI agents aren’t quite up to scratch on some basic business tests

Unlock the Secrets of Ethical Hacking!

Ready to dive into the world of offensive security? This course gives you the Black Hat hacker’s perspective, teaching you attack techniques to defend against malicious activity. Learn to hack Android and Windows systems, create undetectable malware and ransomware, and even master spoofing techniques. Start your first hack in just one hour!

Enroll now and gain industry-standard knowledge: Enroll Now!

Salesforce research finds single-turn tasks see only 58% success, while multi-turn effectiveness drops to 35%
Reasoning models like gemini-2.5-pro tend to outperform lighter models
CRMArena-Pro has proven to be a challenging benchmark

Researchers from Salesforce AI Research have introduced a new benchmark – CRMArena-Pro – which uses synthetic enterprise data to access LLM agent performance in difference CRM scenarios.

It found LLM agents achieved around 58% success on tasks which can be completed in a single step, with tasks that require multiple interactions dropping in effectiveness to just 35% – barely more than one in three.

Although models like gemini-2.5-pro achieved over 83% success in workflow execution, the Salesforce researchers still highlighted some concerns with AI agents, suggesting they might not quite be up to scratch after all.

Are AI agents actually that good?

The paper, entitled ‘Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions’, explained that LLM agents displayed near-zero inherent confidentiality awareness, noting that their performance in handling sensitive information is only improved with explicit prompting (which often came at the expense of task success).

They also criticized previous and existing benchmarks for failing to capture multi-turn interactions, addressing B2B scenarios or confidentiality, and reflecting realistic data environments. CRMArena-Pro is build on synthetic data validated by CRM experts, covering B2B and B2C settings.

In terms of analysis results, reasoning models like gemini-2.5-pro and o1 outperformed lighter models most of the time – Salesforce’s researchers concluded that models that seek more clarifications generally perform better, especially in multi-turn tasks.

For example, while the average performance across the nine models tested (three each from OpenAI, Google and Meta) resulted in a score of 35.1%, gemini-2.5-pro scored 54.5%.

“These findings suggest a significant gap between current LLM capabilities and the multifaceted demands of real-world enterprise scenarios, positioning CRMArena-Pro as a challenging testbed for guiding future advancements in developing more sophisticated, reliable, and confidentiality-aware LLM agents for professional use,” the researchers concluded.

Looking ahead, Salesforce CEO Marc Benioff views AI agents as a high-margin opportunities, with major corporate clients including governments betting on AI agents for boosted efficiency and further cost savings.

Unlock the Secrets of Ethical Hacking!