Real-time benchmarks

Real-time benchmarks

Measured on live audio streams to quantify WER, background speaker suppression, and VAD stability under real-world conditions.

Measured on live audio streams to quantify WER, background speaker suppression, and VAD stability under real-world conditions.

Lower WER in real time

Up to 43% relative WER reduction across major ASR providers.

Stable turn detection

Higher F1 score and balanced accuracy than Silero VAD.

Fewer background insertions

Cleaner input means higher ASR accuracy, smarter VAD, and steadier LLM.

Word error rates on commercial STT models

Raw

Quail Voice Focus 2.1 S

Quail Voice Focus 2.1 L

Left part: Deletions / Middle: Substitutions / Right: Insertions

lower % is better

Word error rates on commercial STT models

Raw

Quail Voice Focus 2.0

Top part: Insertions / Middle: Substitutions / Bottom: Deletions

lower % is better

Word error rates on commercial STT models

Raw

Quail Voice Focus 2.1 S

Quail Voice Focus 2.1 L

lower % is better

Left part: Deletions / Middle: Substitutions / Right: Insertions

AssemblyAIDeepgramSonioxMistralCartesiaGladiaSpeechmatics
Raw47.7%72.3%77.1%39.5%52.9%25.3%65.9%
Voice Focus 2.1 S17.5%16.7%15.8%14.7%17.8%15.2%13.4%
Voice Focus 2.1 L15.1%15.5%13.1%12.1%16.7%14.2%12.5%

Raw: unprocessed microphone input with no enhancement. Voice Focus 2.1 S: our speech enhancement model optimized for machine understanding - 10x smaller than 2.0, built for high call volumes and edge deployments. Voice Focus 2.1 L: our speech enhancement model optimized for machine understanding - best-in-class quality at 25% lower compute than 2.0.

Word error rates on commercial STT models

Raw

Quail Voice Focus 2.0

lower % is better

Top part: Insertions / Middle: Substitutions / Bottom: Deletions

AssemblyAICartesiaDeepgramGladiaSonioxSpeechmatics
Raw27.2%28.3%22.3%30.9%27.1%26.1%
Voice Focus 2.024.6%22.4%18.3%20.8%15.3%15.0%
Relative reduction−9.6%−20.8%−17.9%−32.7%−43.5%−42.5%

Raw: unprocessed microphone input with no enhancement. Quail Voice Focus 2.0: our speech enhancement model optimized for machine understanding

Word error rates on commercial STT models

English subset

Raw

Krisp

Quail

Top part: Insertions / Middle: Substitutions / Bottom: Deletions

lower % is better

Word error rates on commercial STT models

English subset

Raw

Krisp

Quail

lower % is better

Top part: Insertions / Middle: Substitutions / Bottom: Deletions

Deepgram NovaCartesia Ink-WhisperGladia SolariaAssemblyAI Universal-2ElevenLabs Scribe v1
Raw8.8%12.7%9.8%7.5%9.6%
Krisp11.2%10.9%9.7%9.5%9.7%
Quail8.7%8.8%7.7%7.8%9.1%

Raw: unprocessed microphone input with no enhancement. Krisp: a perceptual denoiser optimized for human listening. Quail Voice Focus 2.0: our speech enhancement model optimized for machine understanding.

Word error rates on commercial STT models

German subset

Raw

Krisp

Quail

Top part: Insertions / Middle: Substitutions / Bottom: Deletions

lower % is better

Word error rates on commercial STT models

German subset

Raw

Krisp

Quail

lower % is better

Top part: Insertions / Middle: Substitutions / Bottom: Deletions

Deepgram NovaCartesia Ink-WhisperGladia SolariaAssemblyAI Universal-2ElevenLabs Scribe v1
Raw21.3%16.9%17.5%12.4%18.5%
Krisp22.0%19.9%20.6%15.4%22.9%
Quail19.4%15.1%12.1%12.3%19.2%

Raw: unprocessed microphone input with no enhancement. Krisp: a perceptual denoiser optimized for human listening. Quail Voice Focus 2.0: our speech enhancement model optimized for machine understanding.

VAD Performance Comparison

Silero VAD

Quail VAD

VAD Performance Comparison

Silero VAD

Quail VAD

F1 ScoreBalanced Accuracy
Silero VAD0.7240.760
Quail VAD0.7670.777

Silero VAD: an open-source voice activity detector. Quail VAD: our voice activity detection model optimized for real-time voice AI pipelines.

Final logo

Bring real-time audio intelligence into your voice AI stack

Bring real-time audio intelligence into your voice AI stack