PebbleObserve - AI Usage

PebbleObserve is how your organisation answers the questions that matter once AI is in production: What is this costing us? Is it working? What are people asking? Where is it failing?

Every PebbleChat conversation, every PebbleFlows agent, and every API call is automatically traced — no code changes, no instrumentation, nothing for users to turn on. PebbleObserve gives you the tools to look inside those traces and turn them into decisions.

What PebbleObserve Lets You Do

See every AI interaction — Full traces for PebbleChat sessions, PebbleFlows agent runs, and tool calls, including which models were used, how long each step took, what was said, and what was returned.
Track cost and usage — Dollar-accurate cost attribution per user, per workspace, per model, per capability, and per time window.
Measure quality — Run evaluations (LLM-as-a-judge, manual labelling, or custom scorers) against production traces or curated datasets.
Manage prompts as a team — Version, roll out, and A/B test prompts centrally so improvements don’t require a code deploy.
Catch issues early — Alerts on cost spikes, error rates, latency regressions, or failed evaluations.

The Three Pillars of PebbleObserve

1. Observability

Every call to every model is captured as a trace — a tree of nested operations showing exactly what happened. Traces group into sessions so you can follow a single user’s PebbleChat conversation end-to-end, across tool calls and agent hops. See Observability for the full feature list.

2. Prompt Management

Prompts live in a versioned registry. You can edit and roll out a new version without redeploying anything — PebbleChat and PebbleFlows pick up the new prompt on the next request. You can A/B test versions, test them in the LLM Playground, and tie prompt versions back to the traces they produced. See Prompt Management.

3. Evaluation

Measuring LLM output quality is fundamentally different to measuring traditional software. PebbleObserve supports LLM-as-a-judge scoring, user feedback aggregation, manual labelling, custom scorers, curated evaluation datasets, and experiments that compare prompt versions or model choices on the same inputs. See Evaluation.

How PebbleObserve Works With PebbleChat and PebbleFlows

Surface	What you see in PebbleObserve
PebbleChat conversation	A session containing the user prompt, model response, reasoning, tool calls, asset-discovery hits, token/cost figures and any inline feedback the user left
PebbleFlows agent run	A trace with one span per node in the flow — model calls, tool calls, document-store queries, branching decisions, memory reads/writes — each with timing and cost
Tool / MCP call	A nested span on the parent trace showing the tool name, request payload, response, and latency
Evaluation runs	An experiment trace comparing prompt or model variants against the same inputs, with per-case scores

Role-Based Access

Users see their own usage under Collaboration & Profile → Profile menu → Usage (personal spend and activity).
Organisation admins see organisation-wide usage, logs, and evaluations under Organisation Admin → PebbleObserve.
Platform admins see cross-organisation aggregated data.

Where to Go Next

Langfuse Overview — The upstream Langfuse project overview (synchronised from the open-source project)
Observability → Get Started — Start tracing your first application
Prompt Management → Get Started — Take control of your prompts
Evaluation → Overview — Set up your first evaluation
Metrics — The built-in metrics and how they’re calculated
API & Data Platform — Query traces, scores, and metrics programmatically
Administration — RBAC, projects, organisations, and data retention

Autonomous Agents Langfuse Overview