Evaluations

The Evaluations page under Admin → Organisation → PebbleObserve is where you set up systematic quality measurement for PebbleAI responses — moving beyond “the response feels okay” to “the response scored 4.2/5 on the LLM-as-judge eval”.

Find it at Admin → Organisation → PebbleObserve → Evaluations.

Why evaluations matter

PebbleChat answers thousands of questions a day. Without evaluations, you’re flying blind:

You don’t know whether response quality is improving or regressing over time
You can’t safely change models, prompts, or routing without risking quality
You can’t prove to stakeholders that AI-generated content meets their bar
You can’t catch the slow drift where a model gradually starts producing worse answers
You can’t tell whether one user’s complaint is a one-off or the tip of a systemic issue

Evaluations turn vibes into measurable scores you can track, alert on, and improve.

What evaluations PebbleAI supports

PebbleAI’s evaluation system is powered by Langfuse and supports several evaluation methods:

Method	What it does	Best for
LLM-as-judge	A separate AI model scores each response against criteria you define	Subjective qualities like helpfulness, tone, factual accuracy when ground truth isn’t available
User feedback	Aggregates explicit thumbs/ratings from users	Real-world satisfaction; honest signal but sparse
Manual labelling	Human reviewers score sample responses	Highest accuracy; lowest scale
Custom scorers	Code-based functions you write to score responses	Objective metrics — length, tool usage, format compliance, JSON validity
Reference comparison	Compare against a known-correct answer in a dataset	Regression testing — “did changing the prompt break anything?”

Where to actually configure

The PebbleObserve admin Evaluations page is the entry point — it lists running and completed evaluations and lets you trigger new ones. The full configuration interface (defining scorers, datasets, schedules, dashboards) lives in the Langfuse-powered evaluation product that ships with PebbleObserve.

Specifically:

Evaluation Overview — the upstream Langfuse evaluation documentation, kept in sync with this install
Evaluation Methods — detailed guides for each method
Experiments — A/B testing prompt versions or model choices on the same inputs

The Langfuse documentation is the authoritative reference for how to configure each evaluation method. This admin page is the organisational view over those evaluations once they’re running.

Common evaluation patterns

Production trace evaluation

Run an evaluation continuously against live PebbleChat traffic. Every nth response gets scored automatically; results feed into a dashboard you can monitor.

Use cases:

Tone monitoring — score every customer-facing response for politeness
Factuality monitoring — score responses against a corpus of known-correct facts
Format compliance — score whether responses follow the rules in your ambient context

Dataset evaluation

Curate a dataset of representative inputs (questions, prompts, scenarios) and run the entire dataset against PebbleChat on a schedule (or on every prompt change).

Use cases:

Regression testing — same dataset, run weekly, see if scores change
Prompt version comparison — A/B test two ambient context drafts on the same dataset
Model comparison — score how different models handle your representative inputs
Pre-launch validation — before rolling out a new flow or agent, run the dataset against it

User-feedback aggregation

Collect explicit user feedback from PebbleChat and aggregate it into evaluation scores.

Note: PebbleChat does not currently have per-message thumbs up/down. The Submit Feedback modal in the help menu is the current feedback channel; it lands in the engineering team’s Jira queue rather than aggregating into evaluation scores. Aggregated user feedback is typically captured via custom flows or integration with the Langfuse user feedback API.

Step-by-step: setting up your first evaluation

This is a high-level outline; the detailed configuration is in the Langfuse Evaluation docs.

Decide what to measure — start with one specific quality (politeness, factual accuracy, format compliance)
Pick a method — LLM-as-judge is usually the easiest first choice
Define the scorer — write a prompt that tells the judge model how to score responses (e.g. “Rate this response for politeness on a 1-5 scale, with 1 being rude and 5 being warmly professional. Reply with just the number.”)
Pick a sample size — 100 responses per day is a good starting point; you can increase from there
Run the evaluation — point it at your production traces or a curated dataset
Watch the dashboard — let it run for a week and look at the distribution
Iterate — adjust the scorer prompt if scores cluster too much or seem inconsistent

Step-by-step: using evaluations to validate a prompt change

The most powerful use case for evaluations:

Curate a small dataset (50-200 inputs) representative of what your users ask
Run the dataset through PebbleChat with the current ambient context — note the average score
Change the ambient context in Configuration → Ambient Context
Re-run the dataset with the new context
Compare — if the score went up, ship the change; if it went down, revert and try again
Repeat for every meaningful prompt change

This converts “I think the new prompt is better” into “the new prompt scored 4.1 vs 3.8 on our standard evaluation set”.

What this admin page shows

The Evaluations page typically shows:

A list of running evaluations with their last-run timestamp and aggregate score
Recent evaluation runs with click-through to detail
A summary dashboard showing the score trend over time

For the deep configuration interface, click through to the underlying Langfuse evaluation product — the link from the page header takes you there.

Tips

Start small. One evaluation, one quality, one dataset. Don’t try to score everything at once.
Make scorer prompts specific. Vague scoring criteria lead to noisy scores. “Rate factual accuracy 1-5” is too vague; “Rate factual accuracy 1-5 considering whether named entities, dates, and numerical claims are correct” is much better.
Use a stronger judge model than the model being judged. Use Claude Opus to judge Haiku, not the other way around.
Re-evaluate when you change anything significant — a new prompt, a new model, a new ambient context, a new flow. Evaluations are how you catch regressions you wouldn’t otherwise see.
Don’t chase the score. A perfect score isn’t the goal — not regressing is. Use the score as a regression signal, not a target.

Langfuse Evaluation Overview — the authoritative reference for how to configure evaluations
Langfuse Evaluation Methods — detailed guides for LLM-as-judge, manual labelling, etc.
Langfuse Experiments — A/B testing prompts and models
Usage — measures cost and activity, not quality (complementary to evaluations)
Logs — lets you investigate individual evaluation runs

Logs Login Activity