Scaling Audio Intelligence: How Sounder Builds a Multilingual World

When we first launched Sounder’s Audio Data Cloud (ADC) platform in 2022, our focus was narrow by design: support English-language podcast creators with best-in-class tools for transcription, analysis, and monetization. At the time, the AI landscape looked very different - before open-source breakthroughs like Whisper or Llama, most of our work required building bespoke solutions from the ground up.
In March 2024, Sounder was acquired by Triton Digital, expanding our reach to publishers and podcasters across the globe. This wasn’t just a new business opportunity - it was a turning point. Suddenly, our pipeline needed to support dozens of languages, dialects, and content formats while maintaining the high accuracy and reliability our partners expected.
It wasn’t as simple as translating what we already had. Spoken language is rich in cultural nuance, and the tools built for English don’t automatically work elsewhere. Supporting a multilingual ecosystem meant rethinking the entire stack—from how we transcribe audio to how we understand and categorize its meaning.
This led us to a core insight: we couldn’t just scale up our English solution. We had to build something fundamentally multilingual from the ground up.
Automatic Speech Recognition (ASR)
As we expand our transcription capabilities beyond English, we face a familiar but daunting question: Should we engineer the solution by ourselves or invest in off-the-shelf models? We have already built a high-performing English ASR system trained on thousands of hours of podcast content, carefully transcribed and annotated by our team. But replicating that level of customization and quality for dozens of new languages isn't just difficult - it is resource-prohibitive.
The recent explosion of high-quality open-source ASR models allows us to leapfrog some of that foundational work. We evaluate several leading options, including NVIDIA’s NeMo, Wav2Vec, and OpenAI’s Whisper. Amongst these, Whisper emerged as the clear front-runner. It is robust, supports nearly 100 languages out of the box, and includes features like language detection, translation, and speaker diarization. More importantly, its transformer-based sequence-to-sequence architecture makes it a strong candidate for fine-tuning and optimization.
But the Whisper model isn't a plug-and-play solution for our needs. Podcast audio presents unique challenges: long-form content, filled with false starts, awkward pauses, spontaneous laughter, and overlapping speakers. The real challenge? Building a system powerful enough to deliver precision at scale — without ever slowing us down.
Engineer It or Expense It?
OpenAI provides a fast, high-quality Whisper API. In our early evaluations, it delivered results 10x faster and with 20% fewer errors compared to running its open-source implementation. But speed comes at a steep cost: Our benchmarking on 142,000 hours of Spanish podcast audio showed that the in-house pipeline was approximately 30x more cost-efficient than the external API — a critical difference at our scale.
The choice was clear: to deliver accurate, affordable multilingual ASR at scale, we had to go deeper into the open-source route—and make it work for us.
Optimize It!
To unlock Whisper’s full potential, we focus on optimizing it for better speed, memory efficiency, and scalability. We perform inference optimization using Ctranslate2 backend and use WhisperS2T for enhanced heuristics, allowing quantization and layer fusion to speed up inference without sacrificing quality. These techniques accelerate inference by up to 3x, all while significantly reducing memory usage.
We also add Voice Activity Detection (VAD) to skip over non-speech sections and integrate Speechbrain models for robust external language identification. Beyond that, we fine-tune our decoding strategies to reduce hallucinations and repetitions—common issues in long-form transcription. Features like adjustable beam size, max token length, and n-gram repetition control all contributed to more coherent, accurate output.
The Payoff
All of this culminated in a pipeline that gives us the best of both worlds: near state-of-the-art multilingual transcription quality with the flexibility and efficiency of an in-house system. Whisper’s latest large-v3 model, combined with our optimizations, lets us support dozens of languages while keeping costs low and performance high.
The numbers tell the story — our model demonstrates excellent performance metrics, achieving:
-
Real-time factor: 99.85x
-
Cost efficiency: $0.011 per hour of content
-
Word Error Rate (WER): 10.09%
ASR isn't just another technology layer — it's a bedrock that powers our entire ecosystem of contextual understanding, brand safety, and targeting. Our work in multilingual transcription has made it possible for us to serve creators and advertisers globally — at scale, and with confidence.
Natural Language Processing (NLP)
From a technical standpoint, going multilingual seems like an easy task nowadays. Once you have multilingual ASR in place, it might seem like analyzing transcripts is as simple as translating them and applying your existing English-language tools. Right?
Have you ever heard a native speaker of a non-English language say they can’t translate an expression into English because there’s no accurate way to convey the meaning? While translation tools are getting better, inherent barriers remain due to different linguistic properties and cultural contexts, which are often impossible to translate. Being able to work natively in the language exposes you to all its nuances and the cultural identity of its speakers — both fundamental when building accurate solutions for sensitive areas such as Brand Safety & Suitability and Contextual Targeting.
After careful thought and evaluation, we concluded that building on top of models trained natively in a plethora of languages, rather than translating and applying existing tools, will help us transcend these barriers and provide us with the necessary capabilities to deliver a market-leading solution.
To this end, we explored multiple open-source multilingual models that could serve as the foundation for our next-gen offering. So far, we have relied heavily on BERT or BERT-like models, which we adapted and customized using carefully curated datasets to deliver top-of-the-line context-aware products. Given our familiarity and the tooling we’ve built over the years, the natural path forward was to look into multilingual counterparts of the models we already use, such as mBERT or XLM-RoBERTa.
While the ability to reuse most of our tech stack was a major advantage, the challenge of acquiring the data necessary to adapt these models to our domains in a multilingual setting led us to explore other options.
The Generative AI Shift
The rapid rise of generative AI models and their ability to perform complex tasks across domains and languages caught our attention, as they allow us to quickly scale not only our language support but also our feature set.
The massive investments major tech organizations are making in generative AI have steered the industry — and the open-source community — toward it, and we believe it will continue to deliver cutting-edge capabilities for years to come.
Models such as LLaMA 3.1 70B have performed reasonably well on most of the tasks we tested, though some reliability issues persist around sensitive content due to their extensive built-in safety protections. However, they are expensive to run at reasonable speeds, requiring multiple GPUs per instance and driving costs up.
Smaller models like LLaMA 3.1 8B are fast and efficient but do not perform as well. And while they are smaller, these models are still one or two orders of magnitude larger than what we’ve used so far.
Scaling Smart with Fine-Tuning
To align these massive models with our product needs and elevate their performance to match that of our English-based ones, we explored recent techniques offering parameter-efficient fine-tuning, which significantly reduces the resources needed.
Techniques such as LoRA and QLoRA make it possible to adapt large models using relatively inexpensive GPUs like the Nvidia L4 or L40S, making our R&D faster and more cost-efficient.
To illustrate the effectiveness of fine-tuning on LLMs, we share the F1 scores of some of the leading proprietary and open-source models that we evaluated on our internal benchmark for contextual targeting (measured in Q3 2024).
Through careful dataset curation and QLoRA fine-tuning of LLaMA 3.1 8B, we were able to match the highest-performing proprietary model that we tested — all while running at a fraction of the cost.
Going multilingual wasn’t just a technical upgrade — it was a fundamental shift in how we think about audio intelligence at Sounder. From transcribing podcasts in dozens of languages to understanding their context and meaning, we’ve reimagined our pipeline to serve a truly global audience.
By combining cutting-edge open-source models with smart engineering and domain-specific tuning, we've built a system that delivers both scale and precision.
This transformation enables us to support creators and advertisers in every corner of the world — with the accuracy, efficiency, and cultural nuance they deserve.
And we’re just getting started. Connect with your Triton Representative to learn more!
Back to blog