Voice Focus 1.1 Benchmark Evaluation | ai-coustics

Voice Focus 1.1 Benchmark Evaluation

Feb 10, 2026

Case Study

This notebook presents a comprehensive evaluation of Voice Focus 1.1 against Krisp BVC and Krisp BVC telephony across two datasets. The analysis includes representative examples and quantitative metrics based on internal development as of February 5, 2025.

Key Findings

WERs over APIs vary up to 10 points for the same data set and model, however overall ranking is consistent.
Krisp BVC is unreliable and is over-suppressing a lot of foreground speech. This is obvious from the large amount of deletions in the quantitative benchmarks
Krisp BVC telephony is more robust than BVC, however we find over-suppression to be a problem here as well. It seems that the model sometimes overconfidently selects the wrong speaker which results in suppression of the main speaker for several seconds.
VoiceFocus 1.1 shows the lowest WERs overall. The model favors under-suppression over over-suppression meaning some residual background speech or noise may remain audible. This is expected and by design:
- ASR systems don’t require a perceptually pleasing audio to transcribe accurately. If enhancement introduces artifacts (spectral smearing, harmonic distortion, musical noise) it creates feature shifts in the ASR model that corrupt phonetic evidence and drive word substitutions.
- We deem over-suppression the bigger risk for production systems, as it irreversibly removes information content which can not be recovered by the LLM. We therefore weigh down under-suppression errors in our training and evaluation protocols

Quantitative Evaluation

Qualitative examples using different models

Example 1

Raw Audio

VoiceFocus 1.1

Krisp BVC Tel

Switchboard 250 (8 kHz)

These are 250 files extracted from open-source switchboard data set. We selected files from the data sets that showed high WER caused by interference from secondary speakers.

The plot shows WER with insertions, substitutions, and deletions components for 5 different commercial API providers.

WER for mix, i.e., the unprocessed audio, is driven by high insertions from background speakers (striped pattern)
Krisp BVC actually increases WER vs unprocessed audio on 4 out of 5 providers (Deepgram, Soniox, AssemblyAI, Gladia). This is driven by a high number deletions (dotted pattern) indicating strong over-suppression of foreground speakers
Krisp BVC telephony works better than BVC
VF 1.1 achieves the best WERs across all APIs. It reduces insertions much more strongly while maintaining a reasonable amount of deletion and substitutions.
- Deepgram: ~43% relative reduction (40.6 → 23.1)
- Soniox: ~41% (41.6 → 24.7)
- AssemblyAI: ~25% (35.6 → 26.6)
- Gladia: ~21% (36.6 → 29.1)
- Speechmatics: ~40% (33.6 → 20.3)
Substitutions remain comparable between VF 1.1 and the unprocessed mix, indicating that the enhancement preserves the spectral characteristics of foreground speech without introducing distortion that would cause the ASR to mishear words.

Quantitative Evaluation

Internal Eval Data Set (8 and 16 kHz)

This is based on an internal data set created by playing clean recordings through artificial mouth and recording with different devices (like laptops and phones) in combination with media devices playing in the background and/or real background speakers.

The plot shows WER with insertions, substitutions, and deletions components for 5 different commercial API providers.

WER for mix, i.e., the unprocessed audio, is driven by high insertions from background speech
Krisp BVC telephony has a lower WER than mix, insertions are greatly reduced but deletions dominate the WER now, suggesting over-suppression of foreground speech.
VF 1.1 reaches best WERs across all APIs. It reduces insertions while only slightly increasing deletions in some cases. Insertions are slightly higher than for Krisp BVC telephony but the lower deletion rate results in a better overall WER.
We did not include Krisp BVC here anymore after we saw how unreliable it is