Evaluations

The Evaluations page under Admin → Organisation → PebbleObserve is where you set up systematic quality measurement for PebbleAI responses — moving beyond “the response feels okay” to “the response scored 4.2/5 on the LLM-as-judge eval”.

Find it at Admin → Organisation → PebbleObserve → Evaluations.

Why evaluations matter

PebbleChat answers thousands of questions a day. Without evaluations, you’re flying blind:

  • You don’t know whether response quality is improving or regressing over time
  • You can’t safely change models, prompts, or routing without risking quality
  • You can’t prove to stakeholders that AI-generated content meets their bar
  • You can’t catch the slow drift where a model gradually starts producing worse answers
  • You can’t tell whether one user’s complaint is a one-off or the tip of a systemic issue

Evaluations turn vibes into measurable scores you can track, alert on, and improve.

What evaluations PebbleAI supports

PebbleAI’s evaluation system is powered by Langfuse and supports several evaluation methods:

MethodWhat it doesBest for
LLM-as-judgeA separate AI model scores each response against criteria you defineSubjective qualities like helpfulness, tone, factual accuracy when ground truth isn’t available
User feedbackAggregates explicit thumbs/ratings from usersReal-world satisfaction; honest signal but sparse
Manual labellingHuman reviewers score sample responsesHighest accuracy; lowest scale
Custom scorersCode-based functions you write to score responsesObjective metrics — length, tool usage, format compliance, JSON validity
Reference comparisonCompare against a known-correct answer in a datasetRegression testing — “did changing the prompt break anything?”

Where to actually configure

The PebbleObserve admin Evaluations page is the entry point — it lists running and completed evaluations and lets you trigger new ones. The full configuration interface (defining scorers, datasets, schedules, dashboards) lives in the Langfuse-powered evaluation product that ships with PebbleObserve.

Specifically:

The Langfuse documentation is the authoritative reference for how to configure each evaluation method. This admin page is the organisational view over those evaluations once they’re running.

Common evaluation patterns

Production trace evaluation

Run an evaluation continuously against live PebbleChat traffic. Every nth response gets scored automatically; results feed into a dashboard you can monitor.

Use cases:

  • Tone monitoring — score every customer-facing response for politeness
  • Factuality monitoring — score responses against a corpus of known-correct facts
  • Format compliance — score whether responses follow the rules in your ambient context

Dataset evaluation

Curate a dataset of representative inputs (questions, prompts, scenarios) and run the entire dataset against PebbleChat on a schedule (or on every prompt change).

Use cases:

  • Regression testing — same dataset, run weekly, see if scores change
  • Prompt version comparison — A/B test two ambient context drafts on the same dataset
  • Model comparison — score how different models handle your representative inputs
  • Pre-launch validation — before rolling out a new flow or agent, run the dataset against it

User-feedback aggregation

Collect explicit user feedback from PebbleChat and aggregate it into evaluation scores.

Note: PebbleChat does not currently have per-message thumbs up/down. The Submit Feedback modal in the help menu is the current feedback channel; it lands in the engineering team’s Jira queue rather than aggregating into evaluation scores. Aggregated user feedback is typically captured via custom flows or integration with the Langfuse user feedback API.

Step-by-step: setting up your first evaluation

This is a high-level outline; the detailed configuration is in the Langfuse Evaluation docs.

  1. Decide what to measure — start with one specific quality (politeness, factual accuracy, format compliance)
  2. Pick a method — LLM-as-judge is usually the easiest first choice
  3. Define the scorer — write a prompt that tells the judge model how to score responses (e.g. “Rate this response for politeness on a 1-5 scale, with 1 being rude and 5 being warmly professional. Reply with just the number.”)
  4. Pick a sample size — 100 responses per day is a good starting point; you can increase from there
  5. Run the evaluation — point it at your production traces or a curated dataset
  6. Watch the dashboard — let it run for a week and look at the distribution
  7. Iterate — adjust the scorer prompt if scores cluster too much or seem inconsistent

Step-by-step: using evaluations to validate a prompt change

The most powerful use case for evaluations:

  1. Curate a small dataset (50-200 inputs) representative of what your users ask
  2. Run the dataset through PebbleChat with the current ambient context — note the average score
  3. Change the ambient context in Configuration → Ambient Context
  4. Re-run the dataset with the new context
  5. Compare — if the score went up, ship the change; if it went down, revert and try again
  6. Repeat for every meaningful prompt change

This converts “I think the new prompt is better” into “the new prompt scored 4.1 vs 3.8 on our standard evaluation set”.

What this admin page shows

The Evaluations page typically shows:

  • A list of running evaluations with their last-run timestamp and aggregate score
  • Recent evaluation runs with click-through to detail
  • A summary dashboard showing the score trend over time

For the deep configuration interface, click through to the underlying Langfuse evaluation product — the link from the page header takes you there.

Tips

  • Start small. One evaluation, one quality, one dataset. Don’t try to score everything at once.
  • Make scorer prompts specific. Vague scoring criteria lead to noisy scores. “Rate factual accuracy 1-5” is too vague; “Rate factual accuracy 1-5 considering whether named entities, dates, and numerical claims are correct” is much better.
  • Use a stronger judge model than the model being judged. Use Claude Opus to judge Haiku, not the other way around.
  • Re-evaluate when you change anything significant — a new prompt, a new model, a new ambient context, a new flow. Evaluations are how you catch regressions you wouldn’t otherwise see.
  • Don’t chase the score. A perfect score isn’t the goal — not regressing is. Use the score as a regression signal, not a target.