Real-time benchmarks
Real-time benchmarks
Measured on live audio streams to quantify WER, background speaker suppression, and VAD stability under real-world conditions.
Measured on live audio streams to quantify WER, background speaker suppression, and VAD stability under real-world conditions.
Lower WER in real time
Up to 43% relative WER reduction across major ASR providers.
Stable turn detection
Higher F1 score and balanced accuracy than Silero VAD.
Fewer background insertions
Cleaner input means higher ASR accuracy, smarter VAD, and steadier LLM.
Word error rates on commercial STT models
Raw
Quail Voice Focus 2.1 S
Quail Voice Focus 2.1 L
Left part: Deletions / Middle: Substitutions / Right: Insertions
lower % is better
Word error rates on commercial STT models
Raw
Quail Voice Focus 2.0
Top part: Insertions / Middle: Substitutions / Bottom: Deletions
lower % is better
Word error rates on commercial STT models
Raw
Quail Voice Focus 2.1 S
Quail Voice Focus 2.1 L
lower % is better
Left part: Deletions / Middle: Substitutions / Right: Insertions
| AssemblyAI | Deepgram | Soniox | Mistral | Cartesia | Gladia | Speechmatics | |
|---|---|---|---|---|---|---|---|
| Raw | 47.7% | 72.3% | 77.1% | 39.5% | 52.9% | 25.3% | 65.9% |
| Voice Focus 2.1 S | 17.5% | 16.7% | 15.8% | 14.7% | 17.8% | 15.2% | 13.4% |
| Voice Focus 2.1 L | 15.1% | 15.5% | 13.1% | 12.1% | 16.7% | 14.2% | 12.5% |
Raw: unprocessed microphone input with no enhancement. Voice Focus 2.1 S: our speech enhancement model optimized for machine understanding - 10x smaller than 2.0, built for high call volumes and edge deployments. Voice Focus 2.1 L: our speech enhancement model optimized for machine understanding - best-in-class quality at 25% lower compute than 2.0.
Word error rates on commercial STT models
Raw
Quail Voice Focus 2.0
lower % is better
Top part: Insertions / Middle: Substitutions / Bottom: Deletions
| AssemblyAI | Cartesia | Deepgram | Gladia | Soniox | Speechmatics | |
|---|---|---|---|---|---|---|
| Raw | 27.2% | 28.3% | 22.3% | 30.9% | 27.1% | 26.1% |
| Voice Focus 2.0 | 24.6% | 22.4% | 18.3% | 20.8% | 15.3% | 15.0% |
| Relative reduction | −9.6% | −20.8% | −17.9% | −32.7% | −43.5% | −42.5% |
Raw: unprocessed microphone input with no enhancement. Quail Voice Focus 2.0: our speech enhancement model optimized for machine understanding
Word error rates on commercial STT models
English subset
Raw
Krisp
Quail
Top part: Insertions / Middle: Substitutions / Bottom: Deletions
lower % is better
Word error rates on commercial STT models
English subset
Raw
Krisp
Quail
lower % is better
Top part: Insertions / Middle: Substitutions / Bottom: Deletions
| Deepgram Nova | Cartesia Ink-Whisper | Gladia Solaria | AssemblyAI Universal-2 | ElevenLabs Scribe v1 | |
|---|---|---|---|---|---|
| Raw | 8.8% | 12.7% | 9.8% | 7.5% | 9.6% |
| Krisp | 11.2% | 10.9% | 9.7% | 9.5% | 9.7% |
| Quail | 8.7% | 8.8% | 7.7% | 7.8% | 9.1% |
Raw: unprocessed microphone input with no enhancement. Krisp: a perceptual denoiser optimized for human listening. Quail Voice Focus 2.0: our speech enhancement model optimized for machine understanding.
Word error rates on commercial STT models
German subset
Raw
Krisp
Quail
Top part: Insertions / Middle: Substitutions / Bottom: Deletions
lower % is better
Word error rates on commercial STT models
German subset
Raw
Krisp
Quail
lower % is better
Top part: Insertions / Middle: Substitutions / Bottom: Deletions
| Deepgram Nova | Cartesia Ink-Whisper | Gladia Solaria | AssemblyAI Universal-2 | ElevenLabs Scribe v1 | |
|---|---|---|---|---|---|
| Raw | 21.3% | 16.9% | 17.5% | 12.4% | 18.5% |
| Krisp | 22.0% | 19.9% | 20.6% | 15.4% | 22.9% |
| Quail | 19.4% | 15.1% | 12.1% | 12.3% | 19.2% |
Raw: unprocessed microphone input with no enhancement. Krisp: a perceptual denoiser optimized for human listening. Quail Voice Focus 2.0: our speech enhancement model optimized for machine understanding.
VAD Performance Comparison
Silero VAD
Quail VAD
VAD Performance Comparison
Silero VAD
Quail VAD
| F1 Score | Balanced Accuracy | |
|---|---|---|
| Silero VAD | 0.724 | 0.760 |
| Quail VAD | 0.767 | 0.777 |
Silero VAD: an open-source voice activity detector. Quail VAD: our voice activity detection model optimized for real-time voice AI pipelines.
