Meet Quail: the most advanced real-time speech enhancement model

Today, we’re introducing Quail – our most compact and efficient model yet, purpose built for real-time and streaming speech enhancement.

Quail delivers exceptional speech clarity and natural sound, even in the most resource-constrained applications, using less than 1% of the processing power of our flagship models. This makes it possible for teams to build offline-first, low-latency products that previously required cloud processing.

Quail is ideal for live conferencing, voice AI agents, communication, audio devices, streaming, broadcast technology, and privacy-sensitive environments.

Solving the challenges of real-time on-device enhancement

Enhancing speech in real-time and on audio devices comes with two major constraints:

  1. Low-latency processing: Applications like digital communication and AI voice agents require low-latency processing. Where a cloud model can process audio with the context of the full file, a real-time model must enhance audio in short frames as it arrives. This imposes strict latency and design trade-offs.
  2. Model capacity: The models must be small enough to run efficiently on devices such as laptops, phones, or smart speakers. This means it’s difficult to match the power and accuracy of large-scale cloud models.

These constraints make it significantly harder to deliver natural, high-quality results in real-time scenarios, but we’re bridging the gap.

Introducing Quail: optimised for naturalness and efficiency

Quail is a new family of models designed specifically for real-time speech enhancement on audio devices and for streaming applications. It is available in two sizes:

  • Quail-S (small)
  • Quail-L (large)

These variants represent different trade-offs between sound quality and compute requirements. The architecture is flexible for your performance needs and latency goals.

Building Quail, we focused on three key innovations to overcome typical real-time device and streaming processing challenges:

  1. Focusing on what really matters

    Quail models isolate your voice from dynamic noise environments and remove late reverberation. This results in improved intelligibility while preserving the natural quality of speech.

  2. Realistic training conditions
    Instead of relying on unrealistic and overly aggressive data augmentation, we trained Quail using natural noise profiles such as wind and real room acoustics, resulting in models that generalize better to real-world conditions. The result? Quail thrives in everyday recording environments.
  3. Tailor-made for efficient deployment
    Our Quail models use a highly optimized neural architecture ideal for streaming applications and fitting even the tightest compute budgets typical of audio hardware.

Understanding Quail in context

How does Quail compare to other real-time and streaming speech enhancement tools? We benchmarked Quail against other leading tools using the widely recognized DAPS dataset:

A graph titled "Speech Quality Benchmark on DAPS Dataset" in aicoustics brand pink colours showing SigMOS and PESQ results of IRIS Audio, Sanas, Hance, Krisp, Nvidia Broadcast, Quail S and Quail L - in this ascending order.

The DAPS (Device and Produced Speech) dataset serves as a publicly available benchmark to assess how effectively speech enhancement models can convert real-world recordings—often affected by noise and reverberation—into clean speech.

To measure model performance, we use two widely accepted industry metrics:

  1. PESQ: Measures how closely the enhanced audio matches the original clean reference.
  2. SIGMOS: A no-reference metric that simulates human subjective listening tests – it measures perceived quality, not just similarity to the original.

Quail performs better than all competitors on both of these metrics.

Let’s get technical: Quail’s highlights

Quail offers users:

  • Real-time processing at 48 kHz with latency as low as 20ms
  • High quality speech at only 0.35 GMACs/sec (Quail-S) and 1.2 GMACs/sec (Quail-L), fitting even the tightest compute budgets 
  • Significantly reduced distortion and over-suppression of speech segments compared to other real-time models
  • Robust performance, especially in dynamic noise environments, different room types, and geometries
  • Configurable architecture to meet specific quality, size, or latency constraints
  • Easy and lightweight deployment through the ai|coustics SDK for embedded, mobile, desktop and cloud environments

Unlock real-time speech enhancement with Quail

Quail represents our commitment to making speech enhancement universally accessible – with natural quality, ultra-low latency, and deployment flexibility.

Stay tuned for more technical insights in the next weeks and for the upcoming release of “Airten”, our specialised audio machine learning inference engine, which will make our Quail models even faster.

Latest updates

Fixing the audio input for voice agents

Voice agents are revolutionising the way we interact with technology – but they can only perform as well as the audio they receive. These systems are built on a complex stack: voice capture, speech recognition (ASR), reasoning (LLMs) and text-to-speech (TTS). While each layer has improved dramatically, one foundational element remains critically underserved and has the potential to break the

Read More
"Introducing our new model: Lark 2", a graphic of an origami bird and ai-coustics logo

Announcing Lark 2: the next generation of reconstructive speech enhancement

Fans of Lark, rejoice: Lark 2 is here. Bolder, better, and stronger than ever, Lark 2 is our most advanced reconstructive speech enhancement model yet. Lark 2, like its predecessor, is built with our speciality reconstructive AI technology which goes beyond just isolating speech to repair existing speech and restore lost information – all while preserving the authentic human voice

Read More

Meet AirTen: the fastest audio real-time runtime period

We’re thrilled to officially launch AirTen – ai|coustics’ purpose-built neural network runtime. Designed especially for real-time audio AI, AirTen delivers unmatched speed, safety, and portability. And the best part? It’s packed into a runtime smaller than the average photo stored on your phone and exclusively powers the models in our SDK. What is AirTen? AirTen (short for AirTensors) is our

Read More

Ready to embrace the power of Voice AI?

Authentic human voices. Studio-quality sound. Real-time capacity. Automated workflows. It starts here.