Key Findings
WERs over APIs vary up to 10 points for the same data set and model, however overall ranking is consistent.
Krisp BVC is unreliable and is over-suppressing a lot of foreground speech. This is obvious from the large amount of deletions in the quantitative benchmarks
Krisp BVC telephony is more robust than BVC, however we find over-suppression to be a problem here as well. It seems that the model sometimes overconfidently selects the wrong speaker which results in suppression of the main speaker for several seconds.
VoiceFocus 1.1 shows the lowest WERs overall. The model favors under-suppression over over-suppression meaning some residual background speech or noise may remain audible. This is expected and by design:
ASR systems don’t require a perceptually pleasing audio to transcribe accurately. If enhancement introduces artifacts (spectral smearing, harmonic distortion, musical noise) it creates feature shifts in the ASR model that corrupt phonetic evidence and drive word substitutions.
We deem over-suppression the bigger risk for production systems, as it irreversibly removes information content which can not be recovered by the LLM. We therefore weigh down under-suppression errors in our training and evaluation protocols
Quantitative Evaluation
Switchboard 250 (8 kHz)
These are 250 files extracted from open-source switchboard data set. We selected files from the data sets that showed high WER caused by interference from secondary speakers.
The plot shows WER with insertions, substitutions, and deletions components for 5 different commercial API providers.
WER for mix, i.e., the unprocessed audio, is driven by high insertions from background speakers (striped pattern)
Krisp BVC actually increases WER vs unprocessed audio on 4 out of 5 providers (Deepgram, Soniox, AssemblyAI, Gladia). This is driven by a high number deletions (dotted pattern) indicating strong over-suppression of foreground speakers
Krisp BVC telephony works better than BVC
VF 1.1 achieves the best WERs across all APIs. It reduces insertions much more strongly while maintaining a reasonable amount of deletion and substitutions.
Deepgram: ~43% relative reduction (40.6 → 23.1)
Soniox: ~41% (41.6 → 24.7)
AssemblyAI: ~25% (35.6 → 26.6)
Gladia: ~21% (36.6 → 29.1)
Speechmatics: ~40% (33.6 → 20.3)
Substitutions remain comparable between VF 1.1 and the unprocessed mix, indicating that the enhancement preserves the spectral characteristics of foreground speech without introducing distortion that would cause the ASR to mishear words.

Quantitative Evaluation
Internal Eval Data Set (8 and 16 kHz)
This is based on an internal data set created by playing clean recordings through artificial mouth and recording with different devices (like laptops and phones) in combination with media devices playing in the background and/or real background speakers.
The plot shows WER with insertions, substitutions, and deletions components for 5 different commercial API providers.
WER for mix, i.e., the unprocessed audio, is driven by high insertions from background speech
Krisp BVC telephony has a lower WER than mix, insertions are greatly reduced but deletions dominate the WER now, suggesting over-suppression of foreground speech.
VF 1.1 reaches best WERs across all APIs. It reduces insertions while only slightly increasing deletions in some cases. Insertions are slightly higher than for Krisp BVC telephony but the lower deletion rate results in a better overall WER.
We did not include Krisp BVC here anymore after we saw how unreliable it is

