Get your SDK keys and test for free in the Developer Platform Start now

How Synthesia scaled voice cloning quality by improving audio at the source

As the world’s most widely adopted AI-avatar platform, Synthesia helps teams turn simple text into engaging videos in minutes. Voice cloning sits at the heart of the experience. As the product scaled and adoption grew, it became clear that how voices were captured mattered just as much as how they were generated.

Unlike studio voice actors, Synthesia’s users record themselves in everyday environments. Many are speaking from home offices or meeting rooms, using laptop microphones that were never designed for professional audio. As a result, some level of acoustic imperfection is inevitable.

As Synthesia’s voice cloning technology evolved, being able to reliably learn a speaker’s identity across these conditions became essential to delivering voices that still felt recognisably human. Solving this challenge required looking beyond model improvements alone, to the quality and consistency of audio at the source.

Why audio quality matters for voice cloning

From an engineering perspective, Synthesia already had a well-defined voice cloning workflow. Users record scripted samples, identity checks verify the speaker and recordings are validated before entering the cloning pipeline. Over time, researchers observed that differences in acoustic quality could affect how consistently models reproduced a speaker’s voice.

Voice cloning models capture more than text. They rely on acoustic cues that define a speaker’s identity, and when recordings differ in reverberation, microphone distance, or capture device, those cues become harder to model consistently.

Not all imperfections have the same impact. While steady background noise can often be mitigated during training, sudden transient sounds can be far more disruptive.

“Our model is particularly sensitive to sudden transient sounds. Short events like a car beep that aren’t filtered out can be very harmful.”

As Synthesia reduced the required training audio to as little as thirty seconds to enable users to build voices more quickly, the margin for acoustic inconsistencies became even more important.

Improving reliability through preprocessing

Rather than increasing model complexity to compensate for noisy inputs, Synthesia focused on standardising audio quality before training.

“If we can assume clean inputs, the modeling problem becomes much simpler. I’d rather solve audio quality upstream than build it into the model.”

To support this approach, Synthesia integrated ai-coustics as the first step in its audio processing workflow. As soon as audio is recorded and uploaded, it is enhanced to reduce background noise, minimise reverberation and stabilise the signal while preserving the speaker’s unique vocal characteristics.

Only then does the audio enter the voice cloning pipeline, where the resulting voice clone is reused across text-to-speech generation, translation and video rendering.

“We took the same audio, cleaned it through different providers, created a voice clone with each, and you could clearly hear the difference. The voice clone created with ai-coustics was just better.”

Cleaner inputs, more consistent voice clones

The impact was immediately audible. Voice clones trained on ai-coustics-processed audio sounded cleaner, more stable and closer to the original speaker. Comparing clones trained on the same recordings with and without preprocessing made the improvement clear.

Just as importantly, preprocessing reduced variability. Output quality became more consistent and speaker identity remained stable despite differences in recording conditions.

“Without speech enhancement, variation in recording conditions caused instability in both output quality and speaker identity.”

More broadly, the results reinforced a core principle across Voice AI systems. Reliability starts where real-world audio enters the machine learning stack. By treating audio enhancement as foundational infrastructure, Synthesia strengthened its voice cloning pipeline and gave a growing user base greater confidence in an already innovative product.

Apply speech enhancement in your pipeline

If you’re working with voice models of your own, cleaner input audio can make a measurable difference. The ai-coustics SDK lets you test speech enhancement directly with your existing data. To get started, sign up for the ai-coustics developer platform and try it with your own speech data.

And if your curious to learn high-quality voice cloning is used in practice, explore Synthesia’s platform for creating lifelike AI avatars and synthetic voices at scale.

Latest updates

Ready to embrace the power of Voice AI?

Authentic human voices. Studio-quality sound. Real-time capacity. Automated workflows. It starts here.