Introducing Tyto: Audio Insight into every call for Voice AI teams, at scale

Home

Blog

Home

Blog

Introducing Tyto: Audio Insight into every call for Voice AI teams, at scale

Written by

Fabian Seipel

CEO and Co-founder

Product

Jun 11, 2026

Voice agents running at scale share a common blind spot. When something breaks or a customer complains, there's no fast way to tell if the fault lies in the stack or the audio feeding it.

The problem runs deeper than tooling. Real-world audio is unpredictable, and between the words leaving a user's mouth and the signal reaching your STT or LLM, it passes through a complex chain of noise, reverberant rooms, cheap microphones, codec compression, and network dropouts. What the user says and what the model receives can be two very different things. When they are, your agent was never set to succeed.

Teams end up sampling calls manually or routing transcripts through an LLM judge. Slow, expensive, and fundamentally not built for audio.

At ai-coustics, we sit upstream of this problem. Our voice isolation and VAD models are the audio input layer that voice stacks run on, covering two of the most failure-prone steps in any pipeline. Tyto is the next layer, giving you insight into whether the audio reaching your agent is likely to cause failure before you need to look at a transcript.

Introducing Tyto

Tyto is a lightweight model that runs on an audio file, or stream, and predicts whether the audio reaching your agent is likely to cause downstream failures. It sits at the front of the voice pipeline, before your VAD, ASR, or speech-to-speech model and outputs a single score plus a detailed explanation of what's driving it.

Run it live during the call to steer agent behavior or in post-call analysis to get back a ranked list of which calls had degraded audio and why, without sampling, manual review or guesswork.

How it works: the Tyto Score

Tyto Risk Score (risk_score) is the headline audio score. It predicts the likelihood of failure of downstream models including speech-to-text, voice activity detection or turn-taking or speech-to-speech models. Lower scores indicate less problematic audio.

Tyto Dimensions: the methodology behind

Beyond the main score, Tyto classifies the type of degradation causing the problem. Every prediction comes with scores across a range of dimensions, all on a 0-1 scale with higher meaning more severe. These dimensions are:

Noise: Ambient or environmental non-speech noise behind the speaker, relative to the speaker's level.
Speaker Reverb: Speaker distance and room reverberance. Low scores indicate dry, near-field audio (close mic); high scores indicate reverberant, far-field audio.
Speaker Loudness: The loudness level of the main speaker. This is the one neutral dimension. It is a level meter, not a degradation score.
Interfering Speech: Interference from additional live speakers audible in the audio.
Background Media: Interfering speech content from media devices such as TVs, radios or smartphones.
Packet Loss: Audio dropouts or discontinuities in the audio stream or file such as from network packet loss, jitter, frame erasure, or CPU overload.

These qualitative dimensions are near-orthogonal. For example, a call can be free of background noise yet severely degraded on packet loss.

Qualitative Examples with Tyto Scores

Noise

Speaker Reverb

Speaker Loudness

Interfering Speech

Background Media

Packet Loss

Noise

0:00 / 0:00

Qualitative Examples with Tyto Scores

Noise

Speaker Reverb

Speaker Loudness

Interfering Speech

Background Media

Packet Loss

Noise

0:00 / 0:00

Qualitative Examples with Tyto Scores

Noise

Speaker Reverb

Speaker Loudness

Interfering Speech

Background Media

Packet Loss

Noise

0:00 / 0:00

From insight to action

Tyto works at every stage of the voice pipeline, in real-time and after the fact.

Post-call analysis: Run Tyto over your call archive to surface which calls had degraded audio, what type of degradation occurred, and how likely it was to impact performance. For teams running at scale, this replaces manual sampling and complements LLM-based review with an audio-native signal that's faster and cheaper to run.
Real-time adaptive behavior: In streaming mode, Tyto enables the agent stack to respond dynamically to changing audio conditions mid-call, whether that means adjusting VAD sensitivity, disabling barge-in in a noisy environment, or prompting the user to move somewhere quieter.

Test out Tyto today

The next frontier for voice agent performance isn't better models, it’s better visibility into where your agent is breaking. Tyto is the first step toward that: a lightweight, audio-native signal that tells you what's happening before it becomes a problem.

Tyto Documentation

It's available now via the ai-coustics SDK, with Python support at launch. We're working closely with voice AI teams in early access and will be expanding support for additional languages and platforms over the coming days. If you're running voice agents in production and would like more bespoke advice, talk to our team.

Try it for free

Book a demo

Phonely x ai-coustics: Speaker isolation and audio insight for production Voice AI

Previously:

Engineering production grade voice agents with PolyAI and ai-coustics

See all articles