ai-coustics | Audio intelligence

Core agent stack

What is core agent stack?

The core agent stack is the set of components that make up a production voice agent. It typically consists of VAD, ASR, a language model with tool calling, and TTS wired together in a real-time orchestration framework. Additional components turn it into a full voice agents: telephony or WebRTC for transport, speech enhancement for reliability.

What is an example of a core agent stack?

A typical voice agent built today uses LiveKit or Pipecat for transport and orchestration, ai-coustics for speech enhancement and VAD, and a core stack of: Deepgram or Whisper for ASR, GPT-4 or Claude for reasoning, and ElevenLabs or Cartesia for TTS.

How does core agent stack work?

Audio flows from the caller's microphone through each stage: transport delivers it, enhancement cleans it, VAD segments it, ASR transcribes it, the LLM decides what to do, and TTS generates the response. Quality and latency at every stage determine how natural the conversation feels.

How does ai-coustics help core agent stack?

ai-coustics provides the real-time audio layer before the core agent stack. Quail, Rook, Quail VAD, and Quail Voice Focus plug in between transport and ASR, ensuring every downstream component gets clean, well-segmented speech. Combined with AirTen, this keeps voice agents accurate and responsive without adding GPU costs or latency.

Next term:

Customer Experience (CX) with Voice AI

See all terms