Fixing the audio input for voice agents

Share this article

Voice agents are revolutionising the way we interact with technology – but they can only perform as well as the audio they receive.

These systems are built on a complex stack: voice capture, speech recognition (ASR), reasoning (LLMs) and text-to-speech (TTS). While each layer has improved dramatically, one foundational element remains critically underserved and has the potential to break the complete system: the quality of the audio input.

Poor audio leads to missed cues, poorly timed responses and a frustrating user experience. In high-stakes environments like customer service or sales, that frustration can quickly turn into lost trust, increased churn and reduced ROI.

Why audio quality fails

Despite improvements across the voice stack, issues like noise, echo, reverb and compression still persist. This is largely due to constantly varying input conditions, such as users on iPhone in busy places, band-limited landlines or speakerphones in reverberant rooms. These factors impair ASR and VAD, resulting in common failure modes:

Problem 1: Over-transcription

Cause: VAD picks up background speech as user input
Impact: Irrelevant transcriptions, agent interruptions
Example: Chatter in a café is falsely detected by VAD and agent is interrupted

Problem 2: Under-transcription

Cause: Missed short/quiet user responses due to noise or low SNR
Impact: Repeated prompts, stalled conversation
Example: A short “yes” goes undetected, causing agent loops

Problem 3: Turn-taking errors

Cause: VAD misdetects speech boundaries
Impact: Agent responds too early or too late
Example: Agent cuts off the user or delays response due to background noise

ai-coustics SDK: built for voice agent pipelines

The ai-coustics SDK enhances real-time audio input for voice agents, optimizing speech clarity and VAD performance across environments.

Fast integration via Rust-based SDK with our proprietary AirTen audio inference engine (no ONNX runtime required)
SDK wrappers available for Python, Node.js, Rust, and C++
Down to 10ms latency
Supports flexible deployment on both CPU and GPU, optimized for low-latency and cost efficiency
CPU-efficient and fully compatible with edge and server deployments (1-2% cpu load on single core of Intel Xeon processor)
Adaptable enhancement strength for manual fine-tuning

Benchmark: improved VAD performance with ai-coustics pre-processing

To evaluate how our SDK influences downstream audio tasks, we conducted a controlled experiment on voice activity detection (VAD) performance using the MSDWild dataset – selected for its high background noise levels and challenging acoustic conditions, making them realistic for voice agent scenarios.
We tested the latest Silero VAD (v5) with and without ai-coustics SDK pre-processing.

Results:

Applying our SDK reduced false negatives from 37% to 23% (i.e., 40% fewer missed speech segments).
The true negative rate stays at a similar level from 91% to 89%.
Overall, ROC-AUC improved from 0.866 to 0.902, indicating a better sensitivity-specificity tradeoff (Figure 2).

This suggests that the SDK effectively enhances speech segments while preserving non-speech regions, yielding measurable improvements in VAD performance even without model-specific tuning.

These results validate the SDK’s potential to enhance downstream components like VADs, particularly in noisy or reverberant environments. Our next evaluations will extend to ASR and turn-taking behavior in full agent pipelines.

We're now integrated in Pipecat:

The Pipecat integration lets teams build voice agents that directly interact with live data and APIs. This enables real-time context, automation, and smarter responses within conversations, making agents more production-ready from the start.

Learn more in the GitHub release notes, or the filter overview here on Pipecat.

What’s next

Our SDK is now in open beta (free access throughout August and September) for early adopters to test, integrate and shape its direction.

We’re also launching a self-serve developer platform in September to streamline access, testing and deployment at scale.

More functionality coming soon: Improved VAD, speaker separation, speech quality detection, automatic audio input testing.

We’re curious to learn more about challenges in the voice agent input layer. If you’re building in that space, we’d love to hear from you, feel free to book a call with our founder here.

Latest updates

Introducing the new ai-coustics developer platform

Faster integration, smarter tools, seamless workflow Today we’re unveiling the new ai-coustics developer platform – built to empower developers, engineers, and product teams to unlock the full potential of our audio enhancement models with ease. What’s new With this launch, we’ve introduced updates that make accessing our technology faster, more flexible, and more intuitive: API Playground: Test and experiment with

Fixing the audio input for voice agents

Voice agents are revolutionising the way we interact with technology – but they can only perform as well as the audio they receive. These systems are built on a complex stack: voice capture, speech recognition (ASR), reasoning (LLMs) and text-to-speech (TTS). While each layer has improved dramatically, one foundational element remains critically underserved and has the potential to break the

Announcing Lark 2: the next generation of reconstructive speech enhancement

Fans of Lark, rejoice: Lark 2 is here. Bolder, better, and stronger than ever, Lark 2 is our most advanced reconstructive speech enhancement model yet. Lark 2, like its predecessor, is built with our speciality reconstructive AI technology which goes beyond just isolating speech to repair existing speech and restore lost information – all while preserving the authentic human voice

Ready to embrace the power of Voice AI?

Authentic human voices. Studio-quality sound. Real-time capacity. Automated workflows. It starts here.

Products

Solutions

Resources

Products

Solutions

Resources

Fixing the audio input for voice agents

Why audio quality fails

Problem 1: Over-transcription

Problem 2: Under-transcription

Problem 3: Turn-taking errors

ai-coustics SDK: built for voice agent pipelines

Benchmark: improved VAD performance with ai-coustics pre-processing

We're now integrated in Pipecat:

What’s next

Latest updates

Introducing the new ai-coustics developer platform

Fixing the audio input for voice agents

Announcing Lark 2: the next generation of reconstructive speech enhancement

Ready to embrace the power of Voice AI?

Products

Solutions

Company

General

Stay in touch