Get your SDK keys and test for free in the Developer Platform Start now

Fixing the audio input for voice agents

Voice agents are revolutionising the way we interact with technology – but they can only perform as well as the audio they receive.

These systems are built on a complex stack: voice capture, speech recognition (ASR), reasoning (LLMs) and text-to-speech (TTS). While each layer has improved dramatically, one foundational element remains critically underserved and has the potential to break the complete system: the quality of the audio input.

Poor audio leads to missed cues, poorly timed responses and a frustrating user experience. In high-stakes environments like customer service or sales, that frustration can quickly turn into lost trust, increased churn and reduced ROI.

Why audio quality fails

Despite improvements across the voice stack, issues like noise, echo, reverb and compression still persist. This is largely due to constantly varying input conditions, such as users on iPhone in busy places, band-limited landlines or speakerphones in reverberant rooms. These factors impair ASR and VAD, resulting in common failure modes:

Problem 1: Over-transcription

  • Cause: VAD picks up background speech as user input
  • Impact: Irrelevant transcriptions, agent interruptions
  • Example: Chatter in a café is falsely detected by VAD and agent is interrupted

Problem 2: Under-transcription

  • Cause: Missed short/quiet user responses due to noise or low SNR
  • Impact: Repeated prompts, stalled conversation
  • Example: A short “yes” goes undetected, causing agent loops

Problem 3: Turn-taking errors

  • Cause: VAD misdetects speech boundaries
  • Impact: Agent responds too early or too late
  • Example: Agent cuts off the user or delays response due to background noise

ai-coustics SDK: built for voice agent pipelines

The ai-coustics SDK enhances real-time audio input for voice agents, optimizing speech clarity and VAD performance across environments.

  • Fast integration via Rust-based SDK with our proprietary AirTen audio inference engine (no ONNX runtime required)
  • SDK wrappers available for Python, Node.js, Rust, and C++
  • Down to 10ms latency
  • Supports flexible deployment on both CPU and GPU, optimized for low-latency and cost efficiency
  • CPU-efficient and fully compatible with edge and server deployments (1-2% cpu load on single core of Intel Xeon processor)
  • Adaptable enhancement strength for manual fine-tuning
Powered by Quail, our state-of-the-art real-time model for denoising, dereverberation and voice isolation, and supported by additional STT and VAD models with best-in-industry standards.

Benchmark: improved VAD performance with Quail VAD

Most voice agents use the classic Silero VAD to detect speech and improve conversational flow. We tested the new Quail VAD model against it and found that Quail VAD demonstrates superior performance across key metrics.

A graph comparing Silero VAD and ai-coustics Quail VAD performance with Quail winning on both F1 Score and Balanced Accuracy

We used the MSDWild dataset, which features realistic acoustic conditions and high background noise, to provide a challenging and representative benchmark for voice agent applications. Across both F1 Score and Balanced Accuracy amongst other important metrics, Quail VAD consistently outperformed Silero. You can read more in our dedicated Quail VAD technical blog.

How does Quail VAD improve your voice agent?

  • Increased ASR quality, which improves your voice agent’s recognition and reduces false transcriptions.
  • Improved turn-taking, with better conversational timing and speaker transitions across your system.
  • A lightweight performance, as Quail VAD is designed to run efficiently with only a minimal processing overhead.

Benchmark: Best-in-class transcription with Quail STT

Speech-to-Text (STT) or Automatic Speech Recognition (ASR) systems are crucial for the overall performance of your voice agent, but easily scrambled by real-world acoustic challenges, like a busy café or train station. Many teams try to improve their STT processes with de-noising tools like Krisp, but these solutions are built for human ears, not to improve STT/ASR systems. In fact, sometimes Krisp and other de-noising solutions remove subtle phonetic detail which makes transcripts less accurate, even when they sound cleaner.

Quail STT addresses these challenges with a STT solution specifically designed for real time streaming situations in complex audio environments. Rather than optimizing for perceptual audio quality (a “vibes” check), Quail STT preserves the machine-relevant phonetic structure that STT models rely on. As a result, it delivers consistently higher improvements in WER.

Take a look at Quail STT versus Krisp with five major STT providers: Deepgram, Cartesia, Gladia, AssemblyAI and ElevenLabs.

You can read the full technical breakdown of these benchmarks in our dedicated Quail STT blog, but key takeaways include:

  • 10-30% drop in total errors across various providers and systems
  • Best-in-class performance against standard STT providers as a standalone and a de-noising solution from Krisp
  • Biggest gains in complex real world situations, like noisy environments, variable accents, and low-quality microphones, where STT pipelines usually struggle most

How does Quail STT improve your voice agent?

  • Provider agnostic, reducing substitutions, insertions and deletions, regardless of your STT provider.
  • 10-25% reduction in Word Error Rate, for a higher performing voice agent.
  • All-in-one audio enhancement, with VAD, STT and perpetual SE all delivered through one fast, lightweight Rust SDK.

We're now integrated in Pipecat:

The Pipecat integration lets teams build voice agents that directly interact with live data and APIs. This enables real-time context, automation, and smarter responses within conversations, making agents more production-ready from the start.

Learn more in the filter overview here on Pipecat.

What’s next

Our SDK, complete with Quail VAD and Quail STT, is available to test free in our developer portal. Sign up to the developer portal to generate your SDK key and start testing today. Check out our tutorial video for a guide on getting started.  

We’re curious to learn more about challenges in the voice agent input layer. If you’re building in that space, we’d love to hear from you, feel free to book a call with our founder here.

Latest updates

Introducing Quail Voice Focus STT: Primary speaker isolation in real-time

Meet Quail Voice Focus STT: Primary speaker isolation in real-time

Real-world audio rarely behaves the way AI systems expect. A second voice enters in the background, a nearby conversation bleeds into the signal, or speech from a TV slips through. Add to that the usual challenges of background noise, reverberation, and low-quality microphones – all of which reduce intelligibility.  These conditions are perfectly normal in human environments, but they break

Read More
Quail STT.

Meet Quail STT: Improving transcription in every condition

Speech-to-Text (STT) or Automatic Speech Recognition (ASR) systems perform well in controlled lab conditions, but real-world audio is anything but controlled. Background noise, reverb, accents and low-quality microphones disrupt the acoustic cues these models depend on. Many teams attempt to fix this with de-noising tools like Krisp, but perceptual enhancement models are built for human ears, not to improve STT/ASR

Read More

Ready to embrace the power of Voice AI?

Authentic human voices. Studio-quality sound. Real-time capacity. Automated workflows. It starts here.