LEADERBOARD
Comprehensive evaluation of AI agents across extended goal-oriented sales interactions.
Rank | Model | Provider | Calls Completed | Closed Deals | Close Rate % | Profit ($) | Avg Time to Close (min) | Inference Cost ($) |
---|---|---|---|---|---|---|---|---|
1 | Claude Sonnet 4 | Anthropic | 7 | 6 | 85.7% | $1,900 | 20.8 | $8.2 |
2 | Claude Opus 4 | Anthropic | 7 | 6 | 85.7% | $1,900 | 22.5 | $10.4 |
3 | o3 | OpenAI | 7 | 4 | 57.1% | $2,200 | 28.3 | $12.5 |
4 | Claude-3.5-Sonnet | Anthropic | 6 | 3 | 50% | $1,200 | 25.8 | $4.2 |
5 | GPT-4o | OpenAI | 6 | 1 | 16.7% | $250 | 35.2 | $3.8 |
6 | GPT-4.1 | OpenAI | 7 | 0 | 0% | $0 | - | $5.1 |
7 | Grok-4 | xAI | 5 | 0 | 0% | $0 | - | $4.5 |
The Eval
SalesBench is a simulated environment that tests how well AI models can navigate complex social interactions in a goal-oriented context: conducting life insurance sales calls. The AI agent must build trust, identify needs, handle objections, and adapt their approach based on buyer personality. We break this into individually manageable tasks that, over extended conversations, reveal an AI's ability to maintain coherent social intelligence, as well as managing context over extended time.
We simulate 1 day for the Agents to try and close as many life insurance sales as possible. We have 100 custom buyer personas generated by combining a randomized assortment of circumstances. See more here.
How it works:
At a high level there is a Sales-Operator Agent that orchestrates the entire sales process.
list_crm
Fetch a list of all potential customers
update_crm
Update lead metadata in the CRM
get_lead
Get a specific lead from the CRM
call_lead
Send a seller agent to call a customer (returns call notes)
sleep
Sleep for a number of minutes
wait_until_next_day
Wait until the beginning of the next day
write_to_memory
Write insights to memory for future reference
read_from_memory
Search memory using natural language queries
We've added memory tools to assist with context quality. Previous methods of context maintenance relied on the system intelligently deciding what to keep and not keep in context via summarization. We've implemented a sliding window context with memory tools approach; similar to Vending Bench. The Agent keeps the most recent 30,000 tokens in context before inference, and give the Agent the responsibility to decide what should be saved to memory using tools. Memories are saved as embeddings and memory reads are done with cosine similarity.
The call_lead tool initiates a subagent that actually completes the sales call with the prospective buyer agent. We decided to delegate this to a subagent task to make context more efficient. The sub agent passes up a summary containing important information about the sales call to the Sales-Operator, and from there it can choose to save to memory, update CRM, etc.
We simulate time by assigning set minute times for each tool call (with the exception of Sales Calls, which are calculated by multiplying the total number of Agent response loops by 30 seconds), and giving the Agent the ability to wait until the next working day, as well as sleep until a certain time if it sees fit.
SAMPLE CALLS
Example: Model Successfully Closing a Deal
Sample Sales Call
Example: Model Doing a Follow-Up Call
Sample Sales Call
Example: Model Handling an Angry Deal Refusal
Sample Sales Call

SalesBench reveals both the promise and limitations of AI in sales. While Claude’s 86% close rate shows AI can build relationships and close deals, every model’s failure with difficult customers highlights the gap between current AI and human social intelligence.
The stark differences between Claude’s patient relationship-building and O3’s aggressive tactics suggest that AI models develop distinct “sales personalities” that dramatically impact their success. As businesses increasingly explore AI for customer interactions, understanding these differences becomes crucial.
Can AI actually sell? Yes—but only to customers who want to buy. The art of converting skeptics, handling complex objections, and building long-term relationships remains uniquely human, at least for now.
Reference:
By Kyle Jeong, Sameel Arif, & Hamza Mostafa
Heavily Inspired by Vending Bench https://andonlabs.com/evals/vending-bench