Current large language model (LLM) benchmarks mostly focus on 'academic' tasks such as knowledge Q&A, code generation, and mathematical reasoning. However, when it comes to deploying AI agents in real work scenarios, these tests often fail to reflect the complexity of the real world.
Customer service email handling is a typical example. This work requires:
- Understanding the urgency and business type of emails
- Distinguishing between real security alerts and phishing emails
- Responding to customers in an appropriate tone
- Routing issues to the correct team
- Maintaining context coherence across multiple related emails
These tasks seem simple, but they involve multi-step decision-making, context understanding, and complex state management. More importantly, the cost of mistakes is high: marking important emails as spam may lead to customer churn, while failing to identify phishing emails may pose security risks.