Real-time benchmarks

Name: Quail Voice Focus Real-Time Audio Benchmark Results
Creator: ai-coustics

Measured on live audio streams to quantify WER, background speaker suppression, and VAD stability under real-world conditions.

Lower WER in real time

Up to 43% relative WER reduction across major ASR providers.

Stable turn detection

Higher F1 score and balanced accuracy than Silero VAD.

Fewer background insertions

Cleaner input means higher ASR accuracy, smarter VAD, and steadier LLM.

Word error rates on commercial STT models

Raw

Quail Voice Focus 2.1 S

Quail Voice Focus 2.1 L

Left part: Deletions / Middle: Substitutions / Right: Insertions

lower % is better

WERs

Across AssemblyAI, Deepgram, Soniox, Mistral, Cartesia, Gladia and Speechmatics, Quail Voice Focus 2.1 reduces WERs by up to 81%, improving transcription reliability in the presence of competing background speakers.

Word error rates on commercial STT models

Raw

Quail Voice Focus 2.0

Top part: Insertions / Middle: Substitutions / Bottom: Deletions

lower % is better

WERs

Across AssemblyAl, Cartesia, Deepgram, Glaida, Soniox and Speechmatics, Quail Voice Focus 2.0 reduces WERs by up to 43%, improving transcription reliability in the presence of competing background speakers.

Word error rates on commercial STT models

Raw

Quail Voice Focus 2.1 S

Quail Voice Focus 2.1 L

lower % is better

Left part: Deletions / Middle: Substitutions / Right: Insertions

WERs

Word error rates on commercial STT models

Raw

Quail Voice Focus 2.0

lower % is better

Top part: Insertions / Middle: Substitutions / Bottom: Deletions

WERs

Word error rates on commercial STT models

English subset

Raw

Krisp

Quail

Top part: Insertions / Middle: Substitutions / Bottom: Deletions

lower % is better

WERs on English subset

Quail achieves up to ~20% relative WER reduction across major STT models compared to raw audio and perceptual denoisers like Krisp.

Word error rates on commercial STT models

English subset

Raw

Krisp

Quail

lower % is better

Top part: Insertions / Middle: Substitutions / Bottom: Deletions

WERs on English subset

Quail achieves up to ~20% relative WER reduction across major STT models compared to raw audio and perceptual denoisers like Krisp.

Word error rates on commercial STT models

German subset

Raw

Krisp

Quail

Top part: Insertions / Middle: Substitutions / Bottom: Deletions

lower % is better

WERs on German subset

The gains increase on German: Quail achieves up to ~25% relative WER reduction across the same STT models, where Krisp's perceptual denoising introduces more substitutions and deletions.

Word error rates on commercial STT models

German subset

Raw

Krisp

Quail

lower % is better

Top part: Insertions / Middle: Substitutions / Bottom: Deletions

WERs on German subset

The gains increase on German: Quail achieves up to ~25% relative WER reduction across the same STT models, where Krisp's perceptual denoising introduces more substitutions and deletions.

VAD Performance Comparison

Silero VAD

Quail VAD

Outperforms Silero VAD

Quail VAD outperforms Silero on every metric — F1 Score 0.767 vs. 0.724, Balanced Accuracy 0.777 vs. 0.760. More accurate speech detection, fewer missed segments, better results.

VAD Performance Comparison

Silero VAD

Quail VAD

Outperforms Silero VAD

Quail VAD outperforms Silero on every metric — F1 Score 0.767 vs. 0.724, Balanced Accuracy 0.777 vs. 0.760. More accurate speech detection, fewer missed segments, better results.