HomeFixing the audio input for voice agentsCase StudyFixing the audio input for voice agents

Fixing the audio input for voice agents

Voice agents are revolutionising the way we interact with technology – but they can only perform as well as the audio they receive.

These systems are built on a complex stack: voice capture, speech recognition (ASR), reasoning (LLMs) and text-to-speech (TTS). While each layer has improved dramatically, one foundational element remains critically underserved and has the potential to break the complete system: the quality of the audio input.

Poor audio leads to missed cues, poorly timed responses and a frustrating user experience. In high-stakes environments like customer service or sales, that frustration can quickly turn into lost trust, increased churn and reduced ROI.

Why audio quality fails

Despite improvements across the voice stack, issues like noise, echo, reverb and compression still persist. This is largely due to constantly varying input conditions, such as users on iPhone in busy places, band-limited landlines or speakerphones in reverberant rooms. These factors impair ASR and VAD, resulting in common failure modes:

Problem 1: Over-transcription

  • Cause: VAD picks up background speech as user input
  • Impact: Irrelevant transcriptions, agent interruptions
  • Example: Chatter in a café is falsely detected by VAD and agent is interrupted

Problem 2: Under-transcription

  • Cause: Missed short/quiet user responses due to noise or low SNR
  • Impact: Repeated prompts, stalled conversation
  • Example: A short “yes” goes undetected, causing agent loops

Problem 3: Turn-taking errors

  • Cause: VAD misdetects speech boundaries
  • Impact: Agent responds too early or too late
  • Example: Agent cuts off the user or delays response due to background noise

ai|coustics SDK: built for voice agent pipelines

The ai|coustics SDK enhances real-time audio input for voice agents, optimizing speech clarity and VAD performance across environments.

  • Fast integration via Rust-based SDK with our proprietary AirTen audio inference engine (no ONNX runtime required)
  • SDK wrappers available for Python, Node.js, Rust, and C++
  • Down to 10ms latency
  • Supports flexible deployment on both CPU and GPU, optimized for low-latency and cost efficiency
  • CPU-efficient and fully compatible with edge and server deployments (1-2% cpu load on single core of Intel Xeon processor)
  • Adaptable enhancement strength for manual fine-tuning
Powered by Quail, our state-of-the-art real-time model for denoising, dereverberation and voice isolation

Benchmark: improved VAD performance with ai|coustics pre-processing

Voice Agent benchmark.

To evaluate how our SDK influences downstream audio tasks, we conducted a controlled experiment on voice activity detection (VAD) performance using the MSDWild dataset – selected for its high background noise levels and challenging acoustic conditions, making them realistic for voice agent scenarios.
We tested the latest Silero VAD (v5) with and without ai|coustics SDK pre-processing.

Results:

  • Applying our SDK reduced false negatives from 37% to 23% (i.e., 40% fewer missed speech segments).
  • The true negative rate stays at a similar level from 91% to 89%.
  • Overall, ROC-AUC improved from 0.866 to 0.902, indicating a better sensitivity-specificity tradeoff (Figure 2).

This suggests that the SDK effectively enhances speech segments while preserving non-speech regions, yielding measurable improvements in VAD performance even without model-specific tuning.

These results validate the SDK’s potential to enhance downstream components like VADs, particularly in noisy or reverberant environments. Our next evaluations will extend to ASR and turn-taking behavior in full agent pipelines.

What’s next

Our SDK is now in open beta (free access throughout August and September) for early adopters to test, integrate and shape its direction.

We’re also launching a self-serve developer platform in September to streamline access, testing and deployment at scale.

More functionality coming soon: Improved VAD, speaker separation, speech quality detection, automatic audio input testing.

We’re curious to learn more about challenges in the voice agent input layer. If you’re building in that space, we’d love to hear from you, feel free to book a call with our founder here.

  • Products
  • Solutions
  • Resources
  • Pricing