Voice Input
PebbleChat supports real-time voice conversation with your organisation’s AI. You speak, PebbleChat listens, the AI produces a response that appears as a normal chat message, and a brief verbal summary is spoken back while the full deliverable sits in the chat panel for you to read.
Unlike a “voice mode” that replaces the chat, PebbleAI’s voice is inline in the existing conversation — every voice turn becomes a chat message you can scroll back to, export, and share. The voice layer is a hands-free I/O bridge, not a separate agent.
Voice features must be enabled by your organisation admin in Admin → Configuration → Voice Settings. If you don’t see the microphone button in the composer, ask your admin to enable it.
The two-aspect response pattern
When voice is active, every response PebbleChat produces has two aspects:
display_text— the full written response. Reports, tables, lists, code, everything. Lands in the chat panel exactly as it would for a typed message.voice_text— a brief, conversational summary. Spoken out loud via the TTS engine. Never a verbatim read-out.
This is the single most important thing to understand about voice in PebbleAI. Ask for a 10-page report and you will see a 10-page report in the chat, but you will hear something like “I’ve produced a 10-page report for you — the summary is on screen with three key findings. The biggest risk is around timeline compression.” The report is visual; the voice is a pointer.
This pattern has two default voice preferences:
| Preference | Target length | When to use |
|---|---|---|
| Brief (default) | 1–2 sentences, < 30 words | You’re looking at the screen; voice is a pointer to what’s there |
| Detailed | Up to ~6 sentences | You’re driving, walking, or otherwise can’t see the screen; voice fulfils for you |
Code blocks, tables, URLs, and long lists are abstracted in the spoken version even in detailed mode — nobody wants to hear curl -X POST https://api.example.com/v1/users/... read out character by character.
Starting a voice conversation
The microphone button sits in the composer, to the left of the text input. Two modes are supported:
Press-to-talk
Hold the microphone button to record, release to send. Best for short, precise requests where you want full control over start and stop. The microphone glow indicates audio is being captured, and you’ll see live transcription appear in the composer as you speak.
Hands-free (Voice Activity Detection)
PebbleChat uses voice activity detection to decide when you’ve finished speaking. Tap the microphone once to start a hands-free session; the system listens, waits a short moment of silence to confirm you’re done, then sends the turn. Tap again to end the session.
Hands-free is best for longer, more natural conversations — research sessions, driving, brainstorming, or anything multi-turn.
Full capability parity with typing
Everything you can do by typing, you can do by voice. Voice is not a degraded mode — it’s a different I/O channel for the same underlying agent:
- Tools and MCP connections — voice can trigger Microsoft 365, Jira, or any other MCP tool
- Skills — auto-discovery works identically; a skill that matches your voice request will run
- Document stores — RAG against your org’s knowledge bases works the same
- Web search — if enabled in the composer, voice requests can trigger live search
- Background processing — voice-initiated conversations can run in the background while you go and do something else
- Activity Stream — shows the research, tool use, and reasoning behind voice responses just like typed ones
The voice layer forwards your full session context (web search toggle, MCP connections, thinking profile) to the main chat agent. Nothing is lost by speaking instead of typing.
Interrupting the agent
If PebbleChat is speaking and you want it to stop — because you noticed it misheard you, or the answer is going in the wrong direction — just start talking. Your speech takes priority: playback pauses almost immediately, the agent commits only what it had already said, and your new turn replaces the rest.
The latency bridge
The first thing you’ll notice about voice today is that complex responses take a few seconds before the agent starts speaking — because the full chat pipeline (STT → LLM → tool calls → LLM → TTS) has to complete enough of its work to produce voice_text. To keep you from sitting in silence, PebbleChat plays a latency bridge:
- As soon as your speech is transcribed, the voice agent speaks a short filler phrase: “I’m running multi-steps and tools, please hold whilst I process this for you.”
- At the same time, the frontend plays a subtle keyboard-typing sound effect
- The filler and the sound effect continue while the main agent works
- When
voice_textis ready, the filler stops and the real spoken response begins
This bridge is the only TTS output in PebbleAI that isn’t generated by the main chat agent. It’s classified as “plumbing, not reasoning” and is deliberately simple — the voice layer adds no intelligence, just transport.
Sovereignty and provider choice
Voice has strong sovereignty requirements because speech is sensitive data. PebbleAI’s voice pipeline is designed so admins control exactly which speech-to-text and text-to-speech providers run — including sovereign options hosted in your own region.
Admins pick the STT model, TTS model, and TTS voice in Admin → Configuration → Voice Settings. The platform catalogue includes both global providers (OpenAI) and region-hosted sovereign options (Amazon Polly Neural in ap-southeast-2 for Australian installs). Deeper unified speech-to-speech models are on the roadmap and will replace the current cascaded pipeline once available in your region — reducing latency end-to-end.
Admin configuration
Organisation admins configure voice in Admin → Configuration → Voice Settings:
- STT model and credential — pick from the platform catalogue.
- TTS model and credential — pick from the catalogue.
- TTS voice — dropdown of voices for the selected TTS model (e.g. Olivia, Matthew, Joanna for Polly; alloy, echo, fable, onyx, nova, shimmer for OpenAI).
Every voice call is routed through PebbleRouter alongside your chat model calls, with the same auditability and usage tracking.
Troubleshooting
“The mic button does nothing”
- Check that your browser has microphone permission for your PebbleAI install (
demo.pebblecloud.ioor equivalent) - Confirm voice is enabled for your organisation at Admin → Configuration → Voice Settings — the admin must have picked both STT and TTS models
- Reload the page — the LiveKit session is established at page load and a stale connection can block audio
- Check your OS sound settings — PebbleAI can’t force audio through a muted system
“Transcription is inaccurate”
- Speak at a steady pace — streaming STT handles natural speech well but falls over on rapid-fire delivery
- Reduce background noise — VAD can struggle in loud environments
- Ask your admin which STT model is configured — different models are tuned for different accents and domains
“Playback is 1.5× too fast”
- This was a sample-rate mismatch bug with Amazon Polly that has since been resolved — make sure your PebbleAI install is up to date
- If it persists, report via the Submit Feedback modal in the help menu
“There’s a long silence before the agent speaks”
- The filler bridge should kick in shortly after your speech ends — if it doesn’t, transcription may be taking too long
- Complex questions with many tool calls genuinely take several seconds before the agent can respond
Tips
- Let voice point, not read. The brief mode is tuned so voice directs your attention to what’s on screen. Trust it — don’t ask for verbose voice unless you genuinely can’t see the screen.
- Use hands-free for long research. Press-to-talk for short, precise requests.
- Watch the transcription. It appears in the composer as you speak — if you see it getting something wrong, pause and say it differently.
- Combine with background chat. Speak a long research task, enable background processing, walk away. Come back when the notification fires.
- Report STT errors. If the system consistently mishears a word that matters to your work (a product name, a technical term), submit feedback — admins can tune the STT profile or switch to a model that handles your domain better.
Related
- Chat Settings — the per-chat gear icon where you configure background processing (works in voice too)
- Advanced Features → Activity Stream — see exactly what a voice-initiated chat actually did
- Advanced Features → Background Chat — let voice-initiated research finish while you do something else
- Admin → Configuration → Voice Settings — where admins configure STT and TTS