Evaluation Summary
This notebook presents a comprehensive evaluation of Voice Focus 1.1 against Krisp BVC and Krisp BVC telephony across two datasets. The analysis includes representative examples and quantitative metrics based on internal development as of February 5, 2025.
Key Findings
- WERs over APIs vary up to 10 points for the same data set and model, however overall ranking is consistent.
- Krisp BVC is unreliable and is over-suppressing a lot of foreground speech. This is obvious from the large amount of deletions in the quantitative benchmarks
- Krisp BVC telephony is more robust than BVC, however we find over-suppression to be a problem here as well. It seems that the model sometimes overconfidently selects the wrong speaker which results in suppression of the main speaker for several seconds.
- VoiceFocus 1.1 shows the lowest WERs overall. The model favors under-suppression over over-suppression meaning some residual background speech or noise may remain audible. This is expected and by design:
- ASR systems don’t require perceptually pleasing audio to transcribe accurately. If enhancement introduces artifacts (spectral smearing, harmonic distortion, musical noise) it creates feature shifts in the ASR model that corrupt phonetic evidence and drive word substitutions.
- We deem over-suppression the bigger risk for production systems, as it irreversibly removes information content which can not be recovered by the LLM. We therefore weigh down under-suppression errors in our training and evaluation protocols
Qualitative Examples
Noisy audio
Voice Focus 1.1
Krisp BVC
Krisp BVC telephony
Noisy audio
Voice Focus 1.1
Krisp BVC
Krisp BVC telephony
Noisy audio
Voice Focus 1.1
Krisp BVC telephony
Noisy audio
Voice Focus 1.1
Krisp BVC telephony
Noisy audio
Voice Focus 1.1
Krisp BVC telephony
switch to see more
Quantative Evaluation
Switchboard 250 (8 kHz)
These are 250 files extracted from open-source switchboard data set. We selected files from the data sets that showed high WER caused by interference from secondary speakers.
The plot shows WER with insertions, substitutions, and deletions components for 5 different commercial API providers.
- WER for mix, i.e., the unprocessed audio, is driven by high insertions from background speakers (striped pattern)
- Krisp BVC actually increases WER vs unprocessed audio on 4 out of 5 providers (Deepgram, Soniox, AssemblyAI, Gladia). This is driven by a high number deletions (dotted pattern) indicating strong over-suppression of foreground speakers
- Krisp BVC telephony works better than BVC
- VF 1.1 achieves the best WERs across all APIs. It reduces insertions much more strongly while maintaining a reasonable amount of deletion and substitutions.
- Deepgram: ~43% relative reduction (40.6 → 23.1)
- Soniox: ~41% (41.6 → 24.7) AssemblyAI: ~25% (35.6 → 26.6)
- Gladia: ~21% (36.6 → 29.1)
- Speechmatics: ~40% (33.6 → 20.3)
- Substitutions remain comparable between VF 1.1 and the unprocessed mix, indicating that the enhancement preserves the spectral characteristics of foreground speech without introducing distortion that would cause the ASR to mishear words.
Quantative Evaluation
Internal Eval Data Set (8 and 16 kHz)
This is based on an internal data set created by playing clean recordings through artificial mouth and recording with different devices (like laptops and phones) in combination with media devices playing in the background and/or real background speakers.
The plot shows WER with insertions, substitutions, and deletions components for 5 different commercial API providers.
- WER for mix, i.e., the unprocessed audio, is driven by high insertions from background speech
- Krisp BVC telephony has a lower WER than mix, insertions are greatly reduced but deletions dominate the WER now, suggesting over-suppression of foreground speech.
- VF 1.1 reaches best WERs across all APIs. It reduces insertions while only slightly increasing deletions in some cases. Insertions are slightly higher than for Krisp BVC telephony but the lower deletion rate results in a better overall WER.
- We did not include Krisp BVC here anymore after we saw how unreliable it is


