Engineering production grade voice agents with PolyAI and ai-coustics

Home

Blog

Home

Blog

Engineering production grade voice agents with PolyAI and ai-coustics

Written by

Nell Campbell

Head of Marketing

Case studies

Jun 1, 2026

Voice AI's hardest problem isn't sounding human, it's reliably hearing one. Few teams have met that problem at enterprise scale the way PolyAI has. The NVIDIA-backed company runs more than 2,000 live deployments across 75 languages for customers like Marriott, PG&E and Foot Locker, handling hundreds of thousands of customer calls a day across hospitality, financial services, retail and more. Almost the whole stack, including ASR, is built in-house by one of the strongest engineering teams in voice AI.

A page with key stats resulting from ai-coustics integration into PolyAI. The results at glance: +5% PolyScore (measures tasks completion); -40% False barge-in rates under clean conditions, and by 15% in noisy calls; -30% Empty transcript rate (short utterances failing to be detected as speech)

The challenge

At PolyAI's volume, the audio layer is where voice AI either works or doesn't. Every other quality metric depends on it. By late 2025, the team had identified the audio foundation as the next place to invest. These were classic hard problems of voice at scale. A "yes" or "no" would slip past the VAD entirely and disappear into a "sorry, I didn't hear that" loop. Background noise would trip false interrupts. Sometimes both sides of the conversation would talk over each other until turn-taking collapsed.

The existing stack ran Silero VAD with smart-turn on top. It worked well, but PolyAI had been deliberately selective about what to build on it. Barge-in is the ability for a caller to interrupt the voice agent mid-sentence and have the agent stop, listen and respond to what was just said. It is a natural part of human conversation. Without it, the agent drones on regardless of what the caller does and the conversation feels mechanical.

The catch is that a single false trigger from background noise, a TV or an acoustic echo can have the agent respond to something the caller never said. Two or three of those in a call is enough to lose them. Generic denoising tools that the team had tried in the past hurt ASR more than it helped, because most enhancement is tuned for the human ear rather than the phonetic detail downstream models need. Solving that meant an audio layer built for machine understanding.

"With ai-coustics in the pipeline, we can address these failure modes at the source rather than engineering around them downstream."
Razvan Kusztos, VP of Engineering, PolyAI

Engineering by evaluation

PolyAI is rigorous about evaluation. They ran ai-coustics through the same discipline they bring to every infrastructure change: a structured A/B against the Silero baseline and against Krisp, with WER, latency and concurrent call handling tracked as guardrails for regression.

On top of that, the team built a two-layer qualitative review designed to answer the question those numbers couldn't, which is whether the conversation actually felt right. First, AI agents scored thousands of calls on turn-taking, naturalness and apparent caller frustration. Then the whole team committed to an hour a week of manual annotation on real calls as a check against LLM hallucination.

"With ai-coustics on, we've had the best customer satisfaction scores of all the tests we've done."
Razvan Kusztos, VP of Engineering, PolyAI

What the integration unlocked

PolyAI integrated the ai-coustics SDK directly into their dialog agent platform. The two engineering teams worked through a shared Slack channel, comparing model versions head to head and shipping fixes inside the same release cycle, in a matter of days. When ai-coustics released Quail Voice Focus 2.1, PolyAI was first to deploy it.

The effect on the stack was broad:

VAD migration: PolyAI moved from Silero to ai-coustics' VAD 2.0 in the majority of cases, retaining smart-turn on top of the enhanced audio stream.
Fewer short-utterance failures: Missed "yes", "no" and "uh-huh" moments that used to trigger "sorry, I didn't hear that" loops became markedly more reliable.
Fewer false interrupts: The classic hard real-world cases (a TV in the background, a car radio, a toddler crying) stopped breaking calls.

The next layer

The next layer is audio intelligence. As the data underneath the agent becomes richer, it can do more than clean the signal. ai-coustics is building real-time audio observability that scores every call as it happens and predicts where the agent is most likely to break, giving teams the signals to adapt the conversation as it unfolds.

An agent then can ask a caller to move somewhere quieter when the line degrades, switch enhancement profiles for different acoustic environments, or know when to slow down or hand off. It also allows users to quickly identify failed calls through post-call analysis. PolyAI is among the first teams testing it in production, and PolyAI's commitment to shipping only what works at scale is what makes them the right team to build it.

Curious what audio intelligence can do for your voice agents? Try the ai-coustics SDK in our Developer Platform, or talk to our team about enterprise deployments.

Test SDK for free

Book a demo

Introducing Tyto: Audio Insight into every call for Voice AI teams, at scale

Previously:

Introducing the new VAD 2.0: Robust speech detection for real-time voice AI

See all articles