Prosody Intelligence
Case Study · Prosody Intelligence

Give LLMs Ears Live

A bidirectional acoustic intelligence layer that reads what speech contains that transcripts delete — and uses it to make AI-generated voice actually sound human.

Domain Audio · NLP · TTS
Stack Whisper · Parselmouth · ElevenLabs · Flask
Status Live UI · Full pipeline operational
Built February 2026

Transcripts lie by omission

"I'm fine." can mean genuine calm, quiet fury, exhausted resignation, or bitter sarcasm. The words are identical. The meaning is not. Every AI system that reads text alone is working with the same four-word amputated input.

When you transcribe speech, you get words and timestamps. What you lose is everything that makes human communication actually work: the 200ms pause before a difficult answer, the pitch drop at the end of a sentence that signals finality rather than a question, the energy drop that marks genuine resignation versus performed calm.

LLMs are extraordinarily good at understanding context — but only the context they can see. Feed them plain text and they are reasoning with one channel of a multi-channel signal. They are not reading the room. They are reading a summary of the room written by someone who deleted the subtext.

On the generation side, the problem is symmetric. Text-to-speech systems receive instructions to say words. They have no mechanism to understand that a line tagged "sarcastic" should be delivered slower, with low stability and deliberate restraint — versus "excited," which wants speed and pitch variance. The result is AI voice that consistently sounds like AI: technically accurate, emotionally flat.


Two pipelines. One architecture.

Prosody Intelligence is built as two bidirectional pipelines that share a common insight: acoustic features and semantic content are separate channels that should be processed separately and then fused.

Forward Pipeline — Audio → Understanding
Audio Input
Whisper (L1A)
word timestamps
+
Parselmouth/Praat (L1B)
pitch · energy · rate
Alignment (L2)
annotated transcript
LLM Analysis (L3)
prosody-informed insight

Reverse Pipeline — Text → Expressive Audio
Text Input
LLM Emotion Detection
per-line tagging
Parameter Mapping
stability · style · speed
ElevenLabs TTS
multi-voice routing
Expressive Audio

What the audio actually said

The forward pipeline runs two extraction processes in parallel on every audio input, then aligns them at the segment level.

Whisper Transcription

OpenAI Whisper at word granularity, capturing precise timestamps for every word in the recording. This is the semantic layer — what was said.

Parselmouth Prosody Extraction

Praat's autocorrelation pitch tracking, intensity extraction, and syllable-rate estimation at 10ms resolution. This is the paralinguistic layer — how it was said. Praat is used over alternatives like librosa specifically because its pitch tracking is designed for human speech.

Temporal Alignment

Each transcript segment is annotated with its measured prosodic features: average F0 pitch, pitch direction (rising/flat/falling), energy normalized to 0–1, speaking rate, and pause duration before the segment. Silence gaps over 1 second are flagged as structurally meaningful.

LLM Analysis

The annotated transcript — words plus prosodic metadata per segment — is passed to an LLM for analysis. The LLM now has access to the acoustic context that was deleted in plain transcription, enabling interpretation rather than mere transcription.

The output is a structured JSON containing the full annotated transcript, a prosody visualization (3-panel: waveform, pitch contour, energy/rate), and optionally a speaker-diarized read based on adaptive pitch threshold detection.


The Proof Test

The system includes a built-in A/B validation method: the same audio processed twice — once with text only, once with full prosodic annotation — producing two analyses from the same LLM for direct comparison.

Pass A · Text Only

The LLM receives only the transcript. It reasons from words and their sequence — the same input any standard speech-to-text pipeline delivers. Interpretations are plausible but constrained to semantic content.

Pass B · Text + Prosody

The LLM receives the annotated transcript with pitch, energy, rate, pause data, and direction per segment. Interpretations access subtext: hesitation before an answer, emphasis that contradicts stated meaning, energy collapse marking genuine distress versus performed calm.

The delta between the two passes is the measurable value of acoustic context. It is available directly in the web UI under the "Proof Test" tab — upload audio, get both analyses side by side.


Making AI voice actually sound human

The reverse pipeline inverts the problem. Given text — a script, a document, a council session transcript — it produces expressive, emotionally accurate audio by treating prosody as a target rather than a source.

The core insight is that emotion maps to measurable acoustic parameters. ElevenLabs exposes three primary levers: stability (how much the voice varies), style (expressiveness intensity), and speed. Most systems set these once for a voice and leave them fixed. Prosody Intelligence sets them per line, per emotion, informed by how human speech actually behaves acoustically.

Emotion Stability Style Speed Rationale
sarcastic 0.250.900.90 Low stability lets contempt land; deliberate, dry
dramatic 0.301.000.75 Slow + max style = theatrical weight
contempt 0.700.850.80 Cold superiority — stable, slow, dripping
urgent 0.400.801.25 Speed drives the urgency; pressured, clipped
resigned 0.800.150.70 Near-monotone, slowest — flat, drained, defeated
analytical 0.750.200.95 Clinical, precise — high stability, minimal style
tender 0.700.350.75 Warm, hushed — stable but no broadcast energy
comedic 0.151.001.05 Maximum expressiveness — timing is everything

14 emotions are mapped in total, spanning the full range from collapsed resignation to unhinged comedic energy. The mapping is grounded in what each acoustic profile actually requires — not intuition about what sounds emotional.


The Session Director

The reverse pipeline extends to full multi-speaker productions via the Session Director — a pipeline that ingests a multi-agent transcript, parses it into a structured shooting script, routes each line to the appropriate voice, and composites the final audio.

Each AI model in the RRI Swarm has an assigned ElevenLabs voice selected to match its character in the ecosystem:

CLAUDE
George — Warm Storyteller
Dry snobbery, librarian energy
GROK
Callum — Husky Trickster
Unhinged confidence, chaos energy
GEMINI
Adam — Dominant, Firm
Theatrical gravitas, dramatic weight
GPT
Brian — Deep, Resonant
Muffled desperation from the Penalty Box
KYRA
Kyra — Cloned
The raccoon pulling the strings
NARRATOR
The Narrator
Fourth wall, stage directions

An LLM Script Supervisor parses the raw transcript, separating dialogue from stage directions, assigning speakers and emotions, then routing each line through the emotion parameter map to the appropriate voice. Segments are stitched with configurable crossfade (default 100ms) to eliminate the choppy inter-segment artifacts that plague naive TTS concatenation.


Stack and specifications

Forward Pipeline

TranscriptionOpenAI Whisper-1, verbose JSON, word + segment timestamps
Prosody extractionParselmouth (Python wrapper for Praat) — pitch autocorrelation F0 75–600Hz, intensity extraction, 10ms time-step resolution
Features extractedAvg F0, pitch variance, pitch direction (rising/flat/falling), energy (0–1 normalized), speaking rate (syl/sec estimate), pause duration
Speaker diarizationAdaptive pitch threshold (median of voiced segments), two-speaker split
Visualization3-panel matplotlib: waveform + transcript overlay, pitch contour with direction arrows, energy/rate bars
LLM layerGPT-4o with full prosodic annotations in context

Reverse Pipeline

Emotion detectionGPT-4o, per-line tagging from 14-emotion vocabulary, context-aware (preceding/following lines considered)
TTS engineElevenLabs eleven_multilingual_v2, per-segment voice parameter injection
Parameters mappedstability, style, speed (similarity_boost 0.75, speaker boost enabled across all voices)
Stitchingpydub crossfade, default 100ms, adaptive to segment length
Multi-voice routingSpeaker tag detection → voice ID lookup → tag-stripped TTS text
OutputIndividual segments (MP3) + combined full audio (192k MP3)

Interface

Web UIFlask + Flask-CORS, localhost:5050
TabsAnalyze Audio · Proof Test · Reverse Pipeline
Input formats.m4a .mp3 .wav .ogg .webm (auto-converted to WAV via ffmpeg for Praat)
Session DirectorCLI + programmatic; ingests .docx or .txt transcripts, outputs shooting script JSON + full audio

What this changes

The standard pipeline for any AI system that processes speech is: audio → transcription → text → LLM. The prosodic layer is discarded as a side effect of that first step, and no current production system recovers it.

Prosody Intelligence demonstrates that the acoustic layer is recoverable, alignable with semantic content at the segment level, and actionable — both for improving AI comprehension of human speech and for improving the expressiveness of AI-generated speech.

The same architecture that gives LLMs ears on the input side teaches them to speak on the output side. Forward and reverse pipelines are two expressions of the same principle: acoustic features and semantic content are separate channels that both carry meaning.

Application domains where acoustic context changes the interpretation include clinical interviews (where vocal markers precede verbal disclosure of distress), legal depositions (where hesitation and energy patterns indicate evasion), call center analysis (where sentiment in voice diverges from sentiment in transcript), and any AI voice product where flat, robotic delivery undermines trust.

The system is built, documented, running, and available for demonstration.

← Back to Research