Skip to main content

A Diffusion Model for Speech-to-Text Just Went Open Source. Why Ad Captions Care

Interfaze open-sourced a speech-to-text model that transcribes by diffusion instead of word-by-word decoding: six languages from a tiny adapter on frozen models. Here is what the launch actually claims, why transcription quality sets the ceiling on auto-captions, and what captioning a video costs in Novoads today.

Mauricio Valdivia

Mauricio Valdivia

·7 min

A Diffusion Model for Speech-to-Text Just Went Open Source. Why Ad Captions Care

Six languages from one 42M-parameter adapter

A small team just taught a frozen text model to hear, then published the whole recipe. Interfaze released diffusion-gemma-asr-small, a speech-to-text (ASR) model that transcribes through the diffusion decoder of DiffusionGemma, Google's 26B discrete-diffusion language model, instead of typing a transcript one word at a time. The launch post opens with "We taught a diffusion model to listen in six languages." and calls it "the first multilingual audio diffusion ASR model with six languages from one adapter". The trained part is tiny: a ~42M-parameter adapter, or in the developer's words, "0.16% of the weights touched". The weights sit on Hugging Face under an Apache 2.0 license.

If you make video ads, this is closer to home than it sounds. Every auto-caption you have ever shipped started life as a machine transcript, and the words burned into the frame can only be as good as the speech-to-text layer underneath. That layer just gained an open-source contender with a genuinely different architecture, and studying what it does well (and what it honestly does not) tells you a lot about where captioning is going.

What Interfaze actually released

The release is one model, one launch post, and an unusually candid training diary. The launch, as the developer reports it:

  • Decoder: discrete diffusion; a canvas of random tokens is refined into the transcript over a few parallel passes
  • Trained weights: a ~42M-parameter adapter on frozen backbones, "0.16% of the weights touched"
  • Languages: six from a single adapter (English, German, French, Spanish, Hindi, and Mandarin)
  • Speed: 3 to 5 times real-time depending on clip length, converging "in about 8 parallel passes"
  • Accuracy: a reported 6.6% word error rate on LibriSpeech test-clean, versus 8.3% for Whisfusion, the closest prior diffusion ASR system
  • License and access: Apache 2.0, weights and code on Hugging Face, adapter-only download

Three parts of the write-up matter for anyone who ships captioned video.

A decoder that denoises instead of typing

Conventional transcription decodes autoregressively: one token, then the next, so a longer transcript means more decode steps. According to the launch post, this model starts from a fixed canvas of random vocabulary tokens and anneals it into text, keeping confident predictions each pass and re-randomizing the rest. Interfaze reports that "the model converges in about 8 parallel passes" and that "transcription cost is set by the number of denoising steps, not the length of the transcript". Their concrete figure is roughly 0.7 to 1.5 seconds "of model time for a ~10-second clip, and it doesn't grow with how much someone says".

Almost everything stays frozen

The 26B backbone and the audio encoder (a small frozen Whisper model used purely as a feature extractor) are never trained. Only a small projector plus LoRA adapters learn, which is how the team "trains ~42M parameters on top of a frozen 26B backbone". The diary is refreshingly honest about the failure path: the first attempt skipped the audio encoder entirely and "learned to ignore the audio and hallucinate fluent, confident nonsense", and later a model with improving metrics was really just emitting "the the the" on repeat. Their warning worth keeping: "loss curves will lie to you about it."

Honest numbers, including the ones that trail

By Interfaze's own tables, the model wins its peer group: "6.6% vs Whisfusion's 8.3% on LibriSpeech, with a smaller encoder". Against mainstream autoregressive Whisper it still trails; on the FLEURS English benchmark the post states "we're at 15.7%" while citing roughly 4 to 5% for the strongest Whisper variant. The developer's explanation is data, not architecture: "This model has seen about 219 hours." of audio, while, per the same post, "Whisper-large-v3 was trained on something like five million hours of audio; whisper-small on ~680k". Their verdict on their own model: "It's hungry."

A grid of real UGC creators filming product videos
Novoads · UGC video ads with AI, ready in minutes.
Try now

Why transcription quality is the ceiling on ad captions

Auto-captions feel like a solved commodity until you read them closely on a paid placement. The transcript is the invisible layer that decides whether captions help conversion or quietly sabotage it.

The caption is the most-read line of your creative

A caption is not subtitles for accessibility alone; on sound-off feeds it is the pitch. Every transcription error gets burned into every frame of the export. Nobody proofreads 40 caption files at 11pm before a launch, which is exactly when a mangled price ships.

Where ASR breaks in ad workflows

The failure modes are predictable, and they cluster exactly where ads are unusual speech:

  • Invented names. Your brand and product names are words no transcription model has seen. Untreated, "Lumewell" becomes "loom well".
  • Prices and offers. A spoken "$49" transcribed as "$40" is not a typo, it is a misleading claim on screen. Platform policies treat the creative as a whole, as posts like our guide to TikTok's AI ad disclosure rules make clear.
  • Language drift. Transcribers that default to English can silently translate Spanish or Portuguese audio into English captions mid-video. The voiceover says one thing, the screen says another.
  • Word timing. Karaoke-style captions, the standard look for UGC-style ads, need accurate word-level timestamps, not just correct text. A transcript with sloppy timing highlights the wrong word on every beat.

Why per-clip speed compounds at ad volume

Ad accounts do not transcribe one video, they transcribe batches: 10 to 20 variants per test cycle is normal practice. At Interfaze's reported speed, around a second of model time per 10-second clip, transcribing a 20-variant batch is a seconds-scale job. And because diffusion decoding costs the same passes whether the script is terse or chatty, a fast talker no longer costs more to caption than a slow one. That property matters more to ad pipelines, which are short-clip and high-volume, than to podcast tooling.

How captions work in Novoads today

Novoads captions every generated video from its own audio: transcription, word timing, and rendering are part of the pipeline, not an export step. To be clear about what this news does not change: Novoads runs Whisper-based transcription today, not diffusion ASR. Here is the current setup, with real numbers.

One word-timed default at 0 extra credits

The default karaoke preset transcribes the video's audio with Whisper at word-level timestamps and renders the moving, word-by-word captions on our own rendering service. It adds 0 credits to a video. When you generate an ad in Novoads, captions in this style are simply part of the output.

30 styled presets at flat per-render pricing

Beyond the default there are 30 styled presets, billed flat per render, never by video length:

Preset tierStylesCredits per render
Karaoke (default)10
Basic210.4
Dynamic (animated)90.8

Worked out at test-batch scale: captioning 20 variants costs 0 credits in karaoke, 8 credits in a basic style, or 16 credits in an animated one. The flat price is the point; a 15-second clip and a 60-second clip cost the same to caption.

Two pipeline details that decide accuracy

Model choice gets the headlines, but two unglamorous details move caption accuracy more than anything else, and both are in the Novoads pipeline by default:

  • The script rides along as a hint. Your ad script is passed to the transcriber as a prompt, so invented brand names come back spelled the way you wrote them instead of the way they sound.
  • The language is pinned explicitly. Every transcription request carries the audio's language (or an explicit auto-detect) because a transcriber left on its English default will covertly translate non-English speech into English captions mid-transcript. For a platform whose voices span 11 languages and 31 regional accent variants, that single parameter is the difference between native captions and nonsense.
The Novoads app: pick an AI actor, write a script, generate a UGC ad
Novoads · UGC video ads with AI, ready in minutes.
Try now

What open-source ASR changes for ad makers

You will probably never run this model yourself, and you do not need to for the release to matter.

Usable today, for a narrow slice

If you run an in-house pipeline, it is genuinely testable this week. You get:

  • The model code and the ~42M adapter weights, public under Apache 2.0
  • An adapter-only download; the frozen backbones pull from their own repositories under their own licenses
  • Private, self-hostable transcription for short clips in its six languages, with no per-minute API bill

Not a migration signal yet

By its own benchmarks it trails the best autoregressive models on accuracy, and six languages is well short of what global ad localization needs. Treat it the way we treated the first wave of open-source voice models: a capability floor rising in public, not a drop-in replacement for production captioning.

The pattern is the story

The most transferable fact in the launch is that touching 0.16% of a frozen giant's weights was enough to teach it a new sense. Voice cloning went open source, image generation went open source, and now diffusion transcription has a public recipe. Every layer under AI-generated ads keeps commoditizing, which shifts the value of an ad platform to the layer that does not: orchestrating those models into a finished, on-brand, correctly-captioned creative.

The transcript is becoming a commodity. The caption is not.

Transcription keeps getting cheaper, faster, and more open, and this release is one more step down that curve. What stays scarce is everything wrapped around the transcript: the script hint that protects your brand name, the pinned language that keeps Spanish audio in Spanish, the word timing that makes a karaoke caption land on the beat, the styled render that matches your creative. That is the layer an ad platform earns its keep on. If you want the captioned version of your next ad without building any of this, Novoads generates the video, the voice, and the captions in one flow. Start for $1: three days of full access for one dollar, then $49/mo if you keep it. Cancel anytime.

Frequently Asked Questions

What is diffusion-gemma-asr-small?

It is an open-source speech-to-text model released by Interfaze. Instead of predicting a transcript word by word, it transcribes through the diffusion decoder of DiffusionGemma, Google's 26B discrete-diffusion language model, refining a canvas of random tokens into text over a few parallel passes. Interfaze trained only a ~42M-parameter adapter on top of frozen backbones and reports six supported languages. The weights are on Hugging Face under an Apache 2.0 license.

Is diffusion ASR more accurate than Whisper?

Not yet, by the developer's own numbers. Interfaze reports 6.6% word error rate on LibriSpeech test-clean, which beats Whisfusion, the closest prior diffusion ASR system, at 8.3%, but trails the autoregressive Whisper family in the launch post's own comparison table. Interfaze attributes the gap to training data: the model has seen about 219 hours of audio, versus the millions of hours behind large production models.

Why does a diffusion decoder matter for transcription speed?

Because the decoder refines the whole transcript at once, the compute cost is set by the number of denoising steps, not by how long the transcript is. Interfaze reports convergence in about 8 parallel passes and roughly 0.7 to 1.5 seconds of model time for a 10-second clip. For anyone transcribing large batches of short videos, per-clip time that does not grow with word count is a meaningful property.

Does Novoads use diffusion ASR for captions?

No. Captions in Novoads run on Whisper transcription today. The default karaoke preset transcribes your video's audio with word-level timestamps and renders the captions at 0 additional credits. On top of that there are 30 styled presets: 21 basic styles at 0.4 credits per render and 9 animated styles at 0.8 credits per render, flat regardless of video length.

Why does transcription accuracy matter so much for video ads?

Captions are burned into every frame, so a transcription error is not a typo, it is a permanent part of the creative. Misheard brand names, mangled prices, and captions silently translated into the wrong language are the usual failure modes. Ad workflows can reduce them by passing the script as a transcription hint and pinning the audio language explicitly, which is how Novoads runs its caption pipeline.

Mauricio Valdivia

Mauricio Valdivia

Founder of Novoads

Mauricio is the founder of Novoads, where he works to democratize video advertising with AI for brands in Latin America.

Ready to create video ads with AI?

Generate professional video ads in minutes, not weeks.

Start for $1