Voice Focus 2.0 + Native LiveKit Integration: Production-ready voice agents are here

Home

Blog

Home

Blog

Voice Focus 2.0 + Native LiveKit Integration: Production-ready voice agents are here

Written by

Fabian Seipel

CEO and Co-founder

Case Study

Mar 11, 2026

Millions of voice interactions happen between humans and machines every day, but the gap between what's said and what's heard is wider than most teams realize. As agent deployment accelerates, one problem keeps surfacing: real-world audio is messy and most voice AI pipelines aren't built for it.

Today we're launching two updates that directly address this.

Quail Voice Focus 2.0: the latest iteration of our speaker isolation model, delivering cleaner, reliable speech separation in real-world conditions.
And a native LiveKit integration: bringing the full ai-coustics stack into the most widely used real-time infrastructure layer for voice agents in production.

Together, they move us closer to a world where audio intelligence is built into every voice agent stack by default.

The problem starts before the LLM

Here's something counterintuitive: voice agents don't usually fail because they can't hear, they fail because they hear too much. They get deployed into homes, offices and clinical environments, all with competing audio. Interfering speech is one of the most common failure modes, triggering VAD, bleeding into the transcript and injecting the wrong context into the language model.

The instinctive fix is to use noise cancellation tools, but perceptual enhancers introduce two fundamental problems. First, they enhance all voices in the signal instead of isolating the primary speaker. Second, they are optimized for how audio sounds to human listeners rather than how speech-to-text systems interpret it. By stripping phonetic detail, these tools trade transcription accuracy for the illusion of cleaner audio.

Quail Voice Focus 2.0

Quail Voice Focus 2.0 is designed to solve this problem at the source by isolating the foreground speaker before audio reaches ASR. It's inspired by the cocktail party effect, the innate human ability to focus on a single voice in a noisy environment. We bring that capability to machines, but adjust it to ensure that only the main speaker is in focus.

Foreground isolation is a selection problem under acoustic ambiguity. The system has to determine which speech belongs to the primary interaction and which doesn't, modeling spatial distance, room acoustics, device coloration, temporal overlap and echo behavior simultaneously. The foreground speaker must remain intact while interfering speech, echo and background activity are attenuated without introducing artifacts.

Beyond isolation, it handles denoising, dereverberation and codec artifact repair in a single pass, making it robust across the full range of conditions voice agents face in production.

Additionally, the Quail Voice Focus model comes with natively integrated Voice Activity Detection (VAD).

Illustration comparing traditional noise cancellation with Quail Voice Focus 2.0. In a crowded environment with multiple people speaking, the enhanced system isolates the primary speaker in real time, demonstrating how speaker isolation improves voice AI agent audio reliability in noisy real-world settings.

Quail Voice Focus 2.0 in examples

Example 1

Example 2

Example 3

Raw audio

0:00 / 0:00

Sorry, I'll, I'll get you

there as soon as I can.

Sorry, it's pretty noisy in the taxi.

Yeah, maybe

Wednesday?

What about Wednesday?

Um, around 12:00?

Might as well walk, right?

Yes, yes, I will have

a lunch break,

so it would fit perfectly.

Krisp BVC

0:00 / 0:00

Sorry, uh, luggage is there from the.

Sorry, it's pretty noisy in the taxi.

Yeah, maybe

Wednesday?

What about Wednesday?

Um, around 12:00?

Yes, yes, I will have

a lunch break,

so it would fit perfectly.

Quail Voice Focus 2.0

0:00 / 0:00

Sorry, it's pretty noisy in the taxi.

Yeah, maybe

Wednesday?

What about Wednesday?

Um, around 12:00?

Yes, yes, I will have

a lunch break,

so it would fit perfectly.

Example 1

Example 2

Example 3

Raw audio

0:00 / 0:00

Sorry, I'll, I'll get you

there as soon as I can.

Sorry, it's pretty noisy in the taxi.

Yeah, maybe

Wednesday?

What about Wednesday?

Um, around 12:00?

Might as well walk, right?

Yes, yes, I will have

a lunch break,

so it would fit perfectly.

Krisp BVC

0:00 / 0:00

Sorry, uh, luggage is there from the.

Sorry, it's pretty noisy in the taxi.

Yeah, maybe

Wednesday?

What about Wednesday?

Um, around 12:00?

Yes, yes, I will have

a lunch break,

so it would fit perfectly.

Quail Voice Focus 2.0

0:00 / 0:00

Sorry, it's pretty noisy in the taxi.

Yeah, maybe

Wednesday?

What about Wednesday?

Um, around 12:00?

Yes, yes, I will have

a lunch break,

so it would fit perfectly.

Example 1

Example 2

Example 3

Raw audio

0:00 / 0:00

Sorry, I'll, I'll get you

there as soon as I can.

Sorry, it's pretty noisy in the taxi.

Yeah, maybe

Wednesday?

What about Wednesday?

Um, around 12:00?

Might as well walk, right?

Yes, yes, I will have

a lunch break,

so it would fit perfectly.

Krisp BVC

0:00 / 0:00

Sorry, uh, luggage is there from the.

Sorry, it's pretty noisy in the taxi.

Yeah, maybe

Wednesday?

What about Wednesday?

Um, around 12:00?

Yes, yes, I will have

a lunch break,

so it would fit perfectly.

Quail Voice Focus 2.0

0:00 / 0:00

Sorry, it's pretty noisy in the taxi.

Yeah, maybe

Wednesday?

What about Wednesday?

Um, around 12:00?

Yes, yes, I will have

a lunch break,

so it would fit perfectly.

How Quail Voice Focus 2.0 was trained

Rather than relying on generic augmentation, the model is trained on semi-synthetic audio built around real voice-agent failure modes: near-field speech, competing speakers, device playback, environmental noise and echo-aligned signals in simulated rooms. In other words, the conditions that actually break production pipelines. You can read about the process in more detail in our technical deep-dive blog.

To evaluate performance, we're also open-sourcing Dawn Chorus, a dataset of 450 real-world recordings designed to test foreground speaker isolation, with clean foreground references and transcripts for each example.

Dataset interface on Hugging Face displaying a table with multiple short speech recordings with mix and speech playback controls, transcripts, speaker IDs, language metadata, and conversation type fields.

Benchmark highlights

We benchmarked Quail Voice Focus 2.0 across major commercial STT providers including Assembly, Cartesia, Deepgram, Gladia, Mistral, Soniox, and Speechmatics using the Dawn Chorus dataset.

Benchmark chart comparing word error rate across several speech-to-text models before and after audio enhancement, illustrating how audio preprocessing with ai-coustics speaker isolation strengthens voice AI agent audio reliability. The y axis displays Word Error Rate in percent values of audio samples from Dawn Chorus English v1 16kHz dataset, showing Quail VF 2.0 getting consistenrly lower results than RAW output across STT Providers. For AssemblyAI Universal-2 Live, WER is 27.2% on Raw and 24.5% for Voice Focus. Deletions fall from 6.6% to 4.9%, insertions fall drastically from 5.3% to 0.9% and substitutions rise from 15.3% to 18.8%. For Cartesia Ink Whisper Live, WER is 28.3% on Raw and 22.3% for Voice Focus. Deletions fall from 8.0% to 7.5%, insertions fall drastically from 11.2% to 1.6% and substitutions rise from 9.1% to 13.3%. For Deepgram Nova 3 Live, WER is 22.2% on Raw and 18.3% for Voice Focus. Deletions rise from 6.9% to 7.1%, insertions fall drastically from 8.0% to 2.4% and substitutions rise from 7.4% to 8.8%. For Gladia Live, WER is 30.9% on Raw and 20.8% for Voice Focus. Deletions fall from 7.8% to 7.1%, insertions fall drastically from 15.4% to 2.6% and substitutions rise from 7.7% to 11.1%. For Mistral Voxtral Mini Live, WER is 21.9% on Raw and 19.6% for Voice Focus. Deletions fall from 5.0% to 4.7%, insertions fall from 4.6% to 1.2% and substitutions rise from 12.3% to 13.7%. For Soniox STT Async v4 Live, WER is 27.1% on Raw and 15.4% for Voice Focus. Deletions fall from 6.7% to 6.1%, insertions fall drastically from 14.4% to 2.3% and substitutions rise from 6.0% to 6.9%. For Speechmatics Live, WER is 26.1% on Raw and 14.9% for Voice Focus. Deletions fall from 5.7% to 4.4%, insertions fall drastically from 14.5% to 1.5% and substitutions rise from 5.9% to 9.1%.

The impact is consistent across providers:

10 - 43% reduction in Word Error Rate under real-world conditions
Insertions drop sharply as background voices are suppressed before reaching ASR
Substitutions and deletions remain stable, meaning primary speaker fidelity is preserved

You can explore the full benchmark results for Voice Focus 2.0 here.

Tuning the enhancement level to your stack

No single operating point works across every stack. That's why Quail Voice Focus 2.0 comes with a configurable enhancement level parameter, giving you direct control over suppression aggressiveness for your specific ASR backend. Dial it up for noisy environments, dial it back where over-filtering is the bigger risk.

In the technical deep-dive we go deeper on how the insertion/deletion tradeoff shifts across different STT providers, how to find your optimal operating point, and the training methodology behind the model. It's full of the technical detail you need to make the right decisions for your pipeline.

Native in LiveKit

We're also announcing a native integration with LiveKit, one of the most widely used real-time infrastructure layers for production voice agents. Developers already building on LiveKit now have ai-coustics audio intelligence available natively, with minimal setup.

"By integrating ai-coustics' Voice Focus model into LiveKit Agents, we're bringing a leading ASR-optimized speech enhancement and primary speaker isolation to developers. Unlike traditional noise suppression tuned for human ears, this model preserves critical phonetic and timing cues essential for downstream STT accuracy. It enables voice agents to reliably understand users in noisy, multi-speaker environments."
David Zhao, Co-Founder, LiveKit

The full ai-coustics stack now runs natively inside the framework. That means Quail, our flagship machine-optimized speech enhancement model built for single-speaker environments, processes and cleans audio before it ever reaches ASR. For multi-speaker scenarios, Quail Voice Focus 2.0 adds foreground speaker isolation with integrated VAD, ensuring only the right voice comes through.

Diagram illustrating the audio processing flow in LiveKit, including Input Audio, ai-coustics Audio Intelligence layer, STT (Speech-to-Text), LMM (Language Model Management), and TTS (Text-to-Speech).

By stabilizing the input layer, ai-coustics strengthens everything downstream. For teams building with LiveKit, that means:

No background voice interference: Voice Focus 2.0 ensures the agent responds to the right person, not the TV or coworker in the background.
Integrated VAD: Combine speech enhancement and main speaker voice activity detection in one module.
Better ASR reliability: Trained to preserve phonetic detail, not strip it. A safer default than perceptual enhancers for production pipelines.

Test the new model today

Quail Voice Focus 2.0 is available now in the ai-coustics SDK. Try it for free, read the benchmark of Voice Focus 2.0, or use it with any custom framework, LiveKit, or Pipecat. You can also talk to the team to get started.

Test the SDK for free

Read the technical deep-dive

Quail Voice Focus 2.1: New flagship speaker isolation, 10x more efficient

Previously:

Behind Quail Voice Focus 2.0: Technical benchmark and evaluation

See all articles