We've noticed something odd with an AI agent that's handling customer interactions across several markets. Overall metrics look fine, but when we break things down by customer segment, the quality varies a lot. Some groups consistently get accurate, helpful responses, while others experience more misunderstandings, incorrect tool usage, or incomplete answers. Making changes to improve one segment sometimes causes performance to slip for another, so it's becoming difficult to tell whether a new version is actually better overall. How are teams evaluating and optimizing AI agents when performance needs to stay consistent across multiple user groups instead of just looking good on aggregate metrics?