Evaluations
The Evaluations page under Admin → Organisation → PebbleObserve is where you set up systematic quality measurement for PebbleAI responses — moving beyond “the response feels okay” to “the response scored 4.2/5 on the LLM-as-judge eval”.
Find it at Admin → Organisation → PebbleObserve → Evaluations.
Why evaluations matter
PebbleChat answers thousands of questions a day. Without evaluations, you’re flying blind:
- You don’t know whether response quality is improving or regressing over time
- You can’t safely change models, prompts, or routing without risking quality
- You can’t prove to stakeholders that AI-generated content meets their bar
- You can’t catch the slow drift where a model gradually starts producing worse answers
- You can’t tell whether one user’s complaint is a one-off or the tip of a systemic issue
Evaluations turn vibes into measurable scores you can track, alert on, and improve.
What evaluations PebbleAI supports
PebbleAI’s evaluation system is powered by Langfuse and supports several evaluation methods:
| Method | What it does | Best for |
|---|---|---|
| LLM-as-judge | A separate AI model scores each response against criteria you define | Subjective qualities like helpfulness, tone, factual accuracy when ground truth isn’t available |
| User feedback | Aggregates explicit thumbs/ratings from users | Real-world satisfaction; honest signal but sparse |
| Manual labelling | Human reviewers score sample responses | Highest accuracy; lowest scale |
| Custom scorers | Code-based functions you write to score responses | Objective metrics — length, tool usage, format compliance, JSON validity |
| Reference comparison | Compare against a known-correct answer in a dataset | Regression testing — “did changing the prompt break anything?” |
Where to actually configure
The PebbleObserve admin Evaluations page is the entry point — it lists running and completed evaluations and lets you trigger new ones. The full configuration interface (defining scorers, datasets, schedules, dashboards) lives in the Langfuse-powered evaluation product that ships with PebbleObserve.
Specifically:
- Evaluation Overview — the upstream Langfuse evaluation documentation, kept in sync with this install
- Evaluation Methods — detailed guides for each method
- Experiments — A/B testing prompt versions or model choices on the same inputs
The Langfuse documentation is the authoritative reference for how to configure each evaluation method. This admin page is the organisational view over those evaluations once they’re running.
Common evaluation patterns
Production trace evaluation
Run an evaluation continuously against live PebbleChat traffic. Every nth response gets scored automatically; results feed into a dashboard you can monitor.
Use cases:
- Tone monitoring — score every customer-facing response for politeness
- Factuality monitoring — score responses against a corpus of known-correct facts
- Format compliance — score whether responses follow the rules in your ambient context
Dataset evaluation
Curate a dataset of representative inputs (questions, prompts, scenarios) and run the entire dataset against PebbleChat on a schedule (or on every prompt change).
Use cases:
- Regression testing — same dataset, run weekly, see if scores change
- Prompt version comparison — A/B test two ambient context drafts on the same dataset
- Model comparison — score how different models handle your representative inputs
- Pre-launch validation — before rolling out a new flow or agent, run the dataset against it
User-feedback aggregation
Collect explicit user feedback from PebbleChat and aggregate it into evaluation scores.
Note: PebbleChat does not currently have per-message thumbs up/down. The Submit Feedback modal in the help menu is the current feedback channel; it lands in the engineering team’s Jira queue rather than aggregating into evaluation scores. Aggregated user feedback is typically captured via custom flows or integration with the Langfuse user feedback API.
Step-by-step: setting up your first evaluation
This is a high-level outline; the detailed configuration is in the Langfuse Evaluation docs.
- Decide what to measure — start with one specific quality (politeness, factual accuracy, format compliance)
- Pick a method — LLM-as-judge is usually the easiest first choice
- Define the scorer — write a prompt that tells the judge model how to score responses (e.g. “Rate this response for politeness on a 1-5 scale, with 1 being rude and 5 being warmly professional. Reply with just the number.”)
- Pick a sample size — 100 responses per day is a good starting point; you can increase from there
- Run the evaluation — point it at your production traces or a curated dataset
- Watch the dashboard — let it run for a week and look at the distribution
- Iterate — adjust the scorer prompt if scores cluster too much or seem inconsistent
Step-by-step: using evaluations to validate a prompt change
The most powerful use case for evaluations:
- Curate a small dataset (50-200 inputs) representative of what your users ask
- Run the dataset through PebbleChat with the current ambient context — note the average score
- Change the ambient context in Configuration → Ambient Context
- Re-run the dataset with the new context
- Compare — if the score went up, ship the change; if it went down, revert and try again
- Repeat for every meaningful prompt change
This converts “I think the new prompt is better” into “the new prompt scored 4.1 vs 3.8 on our standard evaluation set”.
What this admin page shows
The Evaluations page typically shows:
- A list of running evaluations with their last-run timestamp and aggregate score
- Recent evaluation runs with click-through to detail
- A summary dashboard showing the score trend over time
For the deep configuration interface, click through to the underlying Langfuse evaluation product — the link from the page header takes you there.
Tips
- Start small. One evaluation, one quality, one dataset. Don’t try to score everything at once.
- Make scorer prompts specific. Vague scoring criteria lead to noisy scores. “Rate factual accuracy 1-5” is too vague; “Rate factual accuracy 1-5 considering whether named entities, dates, and numerical claims are correct” is much better.
- Use a stronger judge model than the model being judged. Use Claude Opus to judge Haiku, not the other way around.
- Re-evaluate when you change anything significant — a new prompt, a new model, a new ambient context, a new flow. Evaluations are how you catch regressions you wouldn’t otherwise see.
- Don’t chase the score. A perfect score isn’t the goal — not regressing is. Use the score as a regression signal, not a target.
Related
- Langfuse Evaluation Overview — the authoritative reference for how to configure evaluations
- Langfuse Evaluation Methods — detailed guides for LLM-as-judge, manual labelling, etc.
- Langfuse Experiments — A/B testing prompts and models
- Usage — measures cost and activity, not quality (complementary to evaluations)
- Logs — lets you investigate individual evaluation runs