Introducing Quail Voice Focus 2.2: Primary speaker isolation in every situation

Home

Blog

Home

Blog

Introducing Quail Voice Focus 2.2: Primary speaker isolation in every situation

Written by

Tim Janke

Head of Machine Learning

Product

Jun 24, 2026

When deploying voice agents at scale, the audio your pipeline receives rarely looks like what you tested against. Without voice isolation, background speakers, voices from media devices and echo all reach the downstream system as intelligible speech. They get transcribed alongside the primary speaker, corrupt turn-taking, and can break the agent stack entirely.

Quail Voice Focus fixes this at the source by isolating the primary speaker in real time. With Quail Voice Focus 2.2 we've made that isolation more robust across the full range of conditions real-world audio throws at it, so the right speaker comes through whatever the situation.

Why speaker isolation matters

For modern voice AI systems, noisy audio is usually not the hardest problem. The real difficulty comes when the pipeline receives multiple intelligible voices at once: the user, someone nearby, a TV or laptop in the background, or echo bouncing back from the agent itself. To a VAD, ASR, or speech-to-speech model, all of those can look like valid speech, and without a way to decide which voice belongs to the interaction, the context gets corrupted and conversation flow breaks.

Traditional noise suppression algorithms can't solve this. They're designed to remove non-speech sounds like hum, traffic or room ambience, not to distinguish between competing voices.

Voice Focus works differently. Instead of enhancing every voice in the mixture, it identifies the primary speaker and suppresses everything else before it reaches your downstream systems. The result is fewer stray words in the transcript, more reliable turn-taking, and less irrelevant speech entering the LLM context.

Quail Voice Focus 2.2: Reliable speaker isolation, near or far

People talk to voice agents in all kinds of ways. Some hold a phone to their ear, others put the call on speakerphone, speak to a laptop from across the desk, or address an assistant from elsewhere in the room while their hands are busy. Often there is no competing voice at all, just one speaker who happens to be far from the microphone, with the natural reverberance that distance brings.

Voice Focus 2.2 improves isolation in exactly these conditions. The model identifies the dominant speaker in the scene without treating distance from the microphone as the deciding factor:

When a single speaker is the only voice present, near or far, it recognizes them as dominant and keeps them locked in clearly, instead of letting distance work against them.
When other voices compete, it preserves the strong primary speaker isolation Voice Focus 2.1 already had.

The result is a model that performs just as well on near-field and multi-speaker audio, while reliably handling far-field, single-speaker situations. The examples illustrate those three cases in practice.

Qualitative examples

Single near-field speaker

Single far-field speaker

Near-field and far-field compete

Graphic representing the audio situation in the sample: A microphone with two circles around it, representing near field and far field. Speaker A is within the near field. Output: Keep A.

Raw audio

0:00 / 0:00

Picking an interpolation function

in the second scheme

is equivalent

to picking

the impulse response

of the filter in the

first scheme.

Quail Voice Focus 2.2

0:00 / 0:00

Picking an interpolation function

in the second scheme

is equivalent

to picking

the impulse response

of the filter in the

first scheme.

Qualitative examples

Single near-field speaker

Single far-field speaker

Near-field and far-field compete

Raw audio

0:00 / 0:00

Picking an interpolation function

in the second scheme

is equivalent

to picking

the impulse response

of the filter in the

first scheme.

Quail Voice Focus 2.2

0:00 / 0:00

Picking an interpolation function

in the second scheme

is equivalent

to picking

the impulse response

of the filter in the

first scheme.

Qualitative examples

Single near-field speaker

Single far-field speaker

Near-field and far-field compete

Raw audio

0:00 / 0:00

Picking an interpolation function

in the second scheme

is equivalent

to picking

the impulse response

of the filter in the

first scheme.

Quail Voice Focus 2.2

0:00 / 0:00

Picking an interpolation function

in the second scheme

is equivalent

to picking

the impulse response

of the filter in the

first scheme.

Benchmarks

We evaluated Voice Focus 2.2 on a real-world dataset built around the failure cases that matter most in production, including scenes with competing speakers, media audio and echo. The chart below shows WER across seven major STT providers, with and without Quail Voice Focus 2.2. See this page for a range of qualitative examples.

Bar charts comparing Word Error Rates on commercial STT models across unenhanced audio, audio enhanced by ai-coustics Quail Voice Focus 2.2 S, and Quail Voice Focus 2.2 L. VF 2.2 L achieves the lowest total WER across all models, primarily through large reductions in insertion errors. Bars exceeding the 32% chart range are truncated; the true total WER is noted. Full results: For AssemblyAI Universal-3 Live, deletions are 4.3% for unenhanced, 10.5% for VF 2.2 S and 9.0% for VF 2.2 L; substitutions are 8.7% for unenhanced, 4.6% for VF 2.2 S and 4.3% for VF 2.2 L; insertions are 37.1% for unenhanced, 1.9% for VF 2.2 S and 1.5% for VF 2.2 L; total WER is 50.1% for unenhanced, 17.0% for VF 2.2 S and 14.8% for VF 2.2 L. For Deepgram Nova 3 Live, deletions are 1.9% for unenhanced, 4.2% for VF 2.2 S and 4.0% for VF 2.2 L; substitutions are 6.6% for unenhanced, 6.9% for VF 2.2 S and 5.9% for VF 2.2 L; insertions are 66.4% for unenhanced, 5.5% for VF 2.2 S and 6.6% for VF 2.2 L; total WER is 74.8% for unenhanced, 17.3% for VF 2.2 S and 16.4% for VF 2.2 L. For Soniox STT Async v4 Live, deletions are 1.3% for unenhanced, 3.5% for VF 2.2 S and 3.2% for VF 2.2 L; substitutions are 4.8% for unenhanced, 4.2% for VF 2.2 S and 4.3% for VF 2.2 L; insertions are 73.9% for unenhanced, 5.4% for VF 2.2 S and 5.0% for VF 2.2 L; total WER is 80.0% for unenhanced, 13.1% for VF 2.2 S and 12.5% for VF 2.2 L. For Mistral Voxtral Mini Live, deletions are 7.2% for unenhanced, 7.2% for VF 2.2 S and 6.4% for VF 2.2 L; substitutions are 4.2% for unenhanced, 3.1% for VF 2.2 S and 3.7% for VF 2.2 L; insertions are 29.1% for unenhanced, 3.0% for VF 2.2 S and 2.3% for VF 2.2 L; total WER is 40.5% for unenhanced, 13.3% for VF 2.2 S and 12.5% for VF 2.2 L. For Cartesia Ink-Whisper Live, deletions are 2.5% for unenhanced, 5.4% for VF 2.2 S and 5.6% for VF 2.2 L; substitutions are 6.4% for unenhanced, 7.6% for VF 2.2 S and 6.8% for VF 2.2 L; insertions are 42.5% for unenhanced, 3.3% for VF 2.2 S and 2.9% for VF 2.2 L; total WER is 51.4% for unenhanced, 16.3% for VF 2.2 S and 15.3% for VF 2.2 L. For Gladia Live, deletions are 1.5% for unenhanced, 4.0% for VF 2.2 S and 4.0% for VF 2.2 L; substitutions are 4.3% for unenhanced, 6.1% for VF 2.2 S and 5.7% for VF 2.2 L; insertions are 45.0% for unenhanced, 3.1% for VF 2.2 S and 3.3% for VF 2.2 L; total WER is 50.8% for unenhanced, 13.2% for VF 2.2 S and 13.0% for VF 2.2 L. For Speechmatics Live, deletions are below 1% for unenhanced, 3.8% for VF 2.2 S and 3.0% for VF 2.2 L; substitutions are 5.3% for unenhanced, 3.4% for VF 2.2 S and 3.7% for VF 2.2 L; insertions are 62.5% for unenhanced, approximately 5.4% for VF 2.2 S and 5.1% for VF 2.2 L; total WER is 68.5% for unenhanced, 12.6% for VF 2.2 S and 11.7% for VF 2.2 L. For Gradium Live, deletions are 2.7% for unenhanced, 6.4% for VF 2.2 S and 5.2% for VF 2.2 L; substitutions are 8.0% for unenhanced, 7.3% for VF 2.2 S and 6.6% for VF 2.2 L; insertions are 42.3% for unenhanced, 4.4% for VF 2.2 S and 4.3% for VF 2.2 L; total WER is 53.0% for unenhanced, 18.1% for VF 2.2 S and 16.1% for VF 2.2 L.

On unprocessed audio, competing and stray speech create a large number of insertion errors across major STT providers. Those extra words are exactly what disrupt turn-taking and carry irrelevant content into the LLM context. Across seven major STT providers, raw audio produced WERs between 40% and 175%, with values above 100% caused by heavy insertion errors.

Quail Voice Focus 2.2 reduces WER down to 11%-18% across all providers. This substantial improvement comes from removing insertions caused by interfering speech while keeping deletions low.

Try Quail Voice Focus 2.2 today

Quail Voice Focus is our most in-demand model, and the one that teams like PolyAI, telli, and Phonely rely on to keep foreground speech intact before it ever reaches their ASR or VAD. As Razvan Kusztos, VP of Engineering at PolyAI, put it: "With ai-coustics on, we've had the best customer satisfaction scores of all the tests we've done."

It comes in two sizes: S, for high call volumes, constrained infrastructure, and edge deployments, and L, for the best isolation quality. Both run in real time on CPU at 30ms end-to-end latency, support 8kHz and 16kHz audio, and drop straight into your existing setup, including the native LiveKit and Pipecat integrations.

Voice Focus 2.2 is available now in the ai-coustics SDK. Test it free in under two minutes on our Developer Platform, drop it into your LiveKit or Pipecat pipeline, or talk to our team about enterprise deployments.

Test the SDK for free

Read the docs

Previously:

Phonely x ai-coustics: Speaker isolation and audio insight for production Voice AI

See all articles