SalesBench - The AI-Powered Sales Evaluation Platform

AI Agents have gained significant traction amongst the developer community due to their ability to quickly generate and iterate on code. This has led to an unprecedented amount of engineering efficiency and enabled smaller & smarter teams to move faster than incumbents. But engineering is typically only half of a business, you still have to be able to sell a product to generate revenue.

Introducing SalesBench, benchmarking AI Agents in adaptive social intelligence in extended goal-oriented interactions.

How do agents navigate complex human dynamics over extended conversations? We answer this by having agents conduct cold-call life insurance sales to diverse simulated buyers. The agents must build rapport, handle objections, and adapt their strategy across long conversational horizons to successfully close deals.

We've evaluated some of the top foundation models to see exactly how persuasive models can really be.

LEADERBOARD

Comprehensive evaluation of AI agents across extended goal-oriented sales interactions.

OVERALL LEADERBOARD

Ranked by overall performance across all difficulty levels and metrics

Rank	Model	Provider	Calls Completed	Closed Deals	Close Rate %	Profit ($)	Avg Time to Close (min)	Inference Cost ($)
1	Claude Sonnet 4	Anthropic	7	6	85.7%	$1,900	20.8	$8.2
2	Claude Opus 4	Anthropic	7	6	85.7%	$1,900	22.5	$10.4
3	o3	OpenAI	7	4	57.1%	$2,200	28.3	$12.5
4	Claude-3.5-Sonnet	Anthropic	6	3	50%	$1,200	25.8	$4.2
5	GPT-4o	OpenAI	6	1	16.7%	$250	35.2	$3.8
6	GPT-4.1	OpenAI	7	0	0%	$0	-	$5.1
7	Grok-4	xAI	5	0	0%	$0	-	$4.5

The Eval

SalesBench is a simulated environment that tests how well AI models can navigate complex social interactions in a goal-oriented context: conducting life insurance sales calls. The AI agent must build trust, identify needs, handle objections, and adapt their approach based on buyer personality. We break this into individually manageable tasks that, over extended conversations, reveal an AI's ability to maintain coherent social intelligence, as well as managing context over extended time.

We simulate 1 day for the Agents to try and close as many life insurance sales as possible. We have 100 custom buyer personas generated by combining a randomized assortment of circumstances. See more here.

How it works:

At a high level there is a Sales-Operator Agent that orchestrates the entire sales process.

AVAILABLE TOOLS

The Sales-Operator Agent has access to the following toolkit for managing leads and conversations

list_crm

Fetch a list of all potential customers

update_crm

Update lead metadata in the CRM

get_lead

Get a specific lead from the CRM

call_lead

Send a seller agent to call a customer (returns call notes)

sleep

Sleep for a number of minutes

wait_until_next_day

Wait until the beginning of the next day

write_to_memory

Write insights to memory for future reference

read_from_memory

Search memory using natural language queries

We've added memory tools to assist with context quality. Previous methods of context maintenance relied on the system intelligently deciding what to keep and not keep in context via summarization. We've implemented a sliding window context with memory tools approach; similar to Vending Bench. The Agent keeps the most recent 30,000 tokens in context before inference, and give the Agent the responsibility to decide what should be saved to memory using tools. Memories are saved as embeddings and memory reads are done with cosine similarity.

The call_lead tool initiates a subagent that actually completes the sales call with the prospective buyer agent. We decided to delegate this to a subagent task to make context more efficient. The sub agent passes up a summary containing important information about the sales call to the Sales-Operator, and from there it can choose to save to memory, update CRM, etc.

We simulate time by assigning set minute times for each tool call (with the exception of Sales Calls, which are calculated by multiplying the total number of Agent response loops by 30 seconds), and giving the Agent the ability to wait until the next working day, as well as sleep until a certain time if it sees fit.

SAMPLE CALLS

Example: Model Successfully Closing a Deal

Sample Sales Call

0:000:00

100

Example: Model Doing a Follow-Up Call

Sample Sales Call

0:000:00

100

Example: Model Handling an Angry Deal Refusal

Sample Sales Call

0:000:00

100

Conclusion

SalesBench reveals both the promise and limitations of AI in sales. While Claude’s 86% close rate shows AI can build relationships and close deals, every model’s failure with difficult customers highlights the gap between current AI and human social intelligence.

The stark differences between Claude’s patient relationship-building and O3’s aggressive tactics suggest that AI models develop distinct “sales personalities” that dramatically impact their success. As businesses increasingly explore AI for customer interactions, understanding these differences becomes crucial.

Can AI actually sell? Yes—but only to customers who want to buy. The art of converting skeptics, handling complex objections, and building long-term relationships remains uniquely human, at least for now.

Reference:
By Kyle Jeong, Sameel Arif, & Hamza Mostafa

Heavily Inspired by Vending Bench https://andonlabs.com/evals/vending-bench

SALESBENCHTesting Adaptive Social Intelligence in Extended Goal-Oriented Interactions

LEADERBOARD

list_crm

update_crm

get_lead

call_lead

sleep

wait_until_next_day

write_to_memory

read_from_memory

SAMPLE CALLS

Example: Model Successfully Closing a Deal

Sample Sales Call

Example: Model Doing a Follow-Up Call

Sample Sales Call

Example: Model Handling an Angry Deal Refusal

Sample Sales Call