M
← All work
AI AgentsSalesAi · 2025 · Intern

A simple way to answer the hard question: is this agent good?

Creating and using an agent only pays off if you can tell whether it's working. I designed an evaluation experience that turns scattered signals into a clear, honest read.

teams running evals · TBC
issues caught pre-release
time to diagnose

Role · Product & UX design — research, information design, prototyping

01

The problem

Quality was a gut feeling. There was no shared, trustworthy way to judge whether an agent was performing well or where it failed.

And no way to tell whether a change made things better or worse.

Before — quality as guesswork (recreated mock)
Fig. 01 — Before: no shared read on quality
02

Understanding why

I worked to define what 'good' even means for an agent, and which signals teams would actually trust.

The challenge was honesty: showing real weaknesses clearly without drowning people in metrics.

Defining what 'good' means
Fig. 02 — The signals teams would actually trust
03

The solution

An evaluation view that pairs a clear headline read with the specific examples behind it.

A team can see the score, understand why, and jump straight to what to fix.

After — evaluation results (recreated, dummy data)
Fig. 03 — After: evaluation in the workflow
04

Outcome

Placeholder for shareable results — e.g. broader adoption of evaluation and faster diagnosis.

Quality moved from a gut feeling to something teams could point at.

Where I'd take it next

Continuous evaluation — quality tracked over time so regressions surface on their own, turning evaluation from an event into a safety net.

Reflection

An honest 'here's where it's weak' earns more trust than a dashboard full of green.

Privacy note — screens are recreated with dummy data and details simplified; the real product evolves with business decisions and user needs.

Next project

Creating an agent — the setup

Read →