Skip to main content

Gemini Omni Flash Turns Text Into 10-Second Video With Sound for $0.10 a Second

Google launched Gemini Omni Flash, a model that generates video with synchronized audio from a text prompt at $0.10 per second. Here is what it changes for ad-makers, and where it sits next to the video models Novoads already runs.

Mauricio Valdivia

Mauricio Valdivia

·11 min

Gemini Omni Flash Turns Text Into 10-Second Video With Sound for $0.10 a Second

A Ten-Second Scene With Sound Now Costs a Dollar

On June 30, a developer could open the Gemini API, type one sentence, and get back a ten-second video that already had sound baked in. No separate voice pass. No music track dragged in afterward. Google shipped Gemini Omni Flash, and the number on the label was $0.10 per second. A ten-second clip lands at a dollar.

Gemini Omni Flash is a text-to-video model that generates synchronized audio along with the picture, grounded in Gemini's real-world knowledge and tuned for better physics so motion holds together. Google launched it beside Nano Banana 2 Lite, a fast image model that turns out a picture in about four seconds for $0.034. The Decoder and other outlets reported both on June 30, 2026.

For anyone making ads, the headline is not the model. It is the price. So this piece does two things: it explains what actually launched, and it places the launch next to the video models a tool like Novoads already runs, because the gap between a cheap clip and a converting ad is the whole story.

What Gemini Omni Flash Actually Is

Strip away the launch-day noise and the model does three things worth understanding before you decide what it means for your ads.

Video and synchronized audio, from one text prompt

The core capability is that Gemini Omni Flash "creates video with synchronized audio from text input." That is the sentence that matters. Earlier text-to-video models handed you a silent clip and left the sound design as your problem. Omni Flash writes the picture and the audio in the same pass, from the same instruction. Google describes it as drawing on "Gemini's knowledge such as history, biology and narrative logic to construct compelling videos," which is a fancy way of saying the model knows how the world tends to look and sound, so the two tracks tend to agree.

For an ad-maker, the practical win is one fewer step between a generated scene and a usable asset. A poured drink can fizz. A door can thud. You are not stitching foley onto silence.

Grounded in Gemini's knowledge, tuned for physics

The second thing Google is selling is coherence. The model is "grounded in Gemini's real-world knowledge, with improved physics understanding for more coherent motion and interaction." In Google's own words, "Omni has an intuitive understanding of forces like gravity, kinetic energy, and fluid dynamics for more realistic movement."

Physics is where cheap video models usually give themselves away. A liquid that smears instead of pours, a product that melts into a hand instead of resting in it, a bottle that drifts a centimeter between frames: those are the tells a viewer's eye catches in the first second, and they are exactly the artifacts a better physics prior is meant to reduce. If you have compared model outputs before, this is the axis that separates a clip you can run from one you quietly delete. Our breakdown of Seedance 2.0 versus Veo for ads walks through why motion quality, not resolution, is the thing to judge.

Conversational editing and the ten-second ceiling

You do not have to re-roll the whole clip to change one thing. Google added "conversational video editing," which lets you "refine and edit videos using natural language," and you can "combine inputs like images, text and video to maintain control and consistency over your scene." That is a real workflow improvement: iterate by talking to the clip instead of regenerating from zero.

The catch is length. "Omni offers 10-second video generations currently, with longer durations coming soon." Ten seconds is a hook, a single beat, a product reveal. It is not a sixty-second testimonial. Keep that number in mind, because it decides what the model is good for and what it is not.

A UGC creator filming a product review at home
Novoads · UGC video ads with AI, ready in minutes.
Try now

The Other Half of the Launch: Nano Banana 2 Lite

Gemini Omni Flash did not ship alone. The image model next to it is arguably the more quietly disruptive release, because it makes the front of the pipeline nearly disposable.

Four seconds, three cents an image

Nano Banana 2 Lite "delivers text-to-image outputs in 4 seconds" at "$0.034 per 1K image." The Decoder put it plainly: "Text-to-image generation takes four seconds and costs just $0.034 per image at 1K resolution." At three and a half cents, an image is close to a rounding error. You can generate fifty product mockups for the cost of a coffee and throw forty-nine of them away.

That matters for ads because the still is where an angle is born. Before you commit a video budget to an idea, a cheap, fast image tells you whether the composition, the palette, and the product framing even work.

The image-to-video chain Google is pitching

Google is not selling these as two separate toys. It recommends a chain: "Use Nano Banana 2 Lite as a high-speed image generation model, then pass that image as a reference to Gemini Omni Flash to animate it into a high-quality video." Storyboard first, motion second.

For a marketer that reads as a familiar loop: sketch the frame cheaply, lock the one you like, then spend the more expensive video seconds only on the survivor. It is the same discipline good creative teams already use, now priced so low that skipping it makes no sense.

What "Lite" gives up

"Lite" is a speed-and-cost tier, not the flagship. The trade is fidelity and headroom for throughput: it is built for ideation and volume, not for the single hero frame you will blow up on a billboard. That is the right tool for the front of a testing pipeline and the wrong tool for the one asset that has to be perfect. Knowing which job you are doing is most of the skill.

What This Means for People Making Ads

Here is where the launch stops being a spec sheet and starts touching a budget. Three shifts matter.

The raw clip keeps getting cheaper

Say you want to test ten angles for a skincare launch. Ten 10-second Gemini Omni Flash clips is about $10 of raw model output at Google's $0.10 per second. That sounds almost negligible, and in isolation it is.

But ten raw clips are not ten ads. Each one still needs a distinct hook, a script that sells a different benefit, an actor or scene that reads as native to the buyer, 9:16 framing, captions, and a beginning-middle-end that earns the last three seconds. The $10 is the cheapest line in the budget. The expensive part, the one that used to cost a week of briefs and hundreds of dollars a clip when you hired it out, is everything wrapped around the model call. Our guide to what UGC creators actually charge shows how steep that wrapped cost gets when a human does it.

Now run the same math the other direction. A finished, ready-to-post clip inside an ad workflow like Novoads costs roughly $2 to $11 depending on the model you pick. Set that next to Google's dollar of raw output and the extra is not markup for markup's sake: it is the script, the actor, the accent, the captions, the format, and the credits accounting that turn a bare generation into an asset you can hand to a media buyer. The raw model falling to a dollar does not collapse that number, because the falling piece was never the expensive one. The lesson repeats every time a new model ships cheaper: the floor on raw video keeps dropping, and the work that sits on top of it keeps its value.

A clip is not an ad

This is the trap the price tempts you into. Cheap raw video makes it feel like the hard part is solved, so people generate a gorgeous ten-second scene, post it, and wonder why it does not convert. The scene was never the problem. The hook, the angle, and the trust signal were, and no amount of physics realism supplies those.

A converting ad is an argument delivered by someone the viewer believes, in the format the platform rewards, tested enough times to find the version that works. A text-to-video model gives you one ingredient of that. A good one, cheaper than ever, but one.

Where a text-to-video model actually helps

None of this means Omni Flash is useless for ads. It is genuinely good at the visual layer: B-roll, scene shots, a product moving through space, an environment that sets a mood. That is real work, and doing it from a prompt for a dollar a clip is a gift. It shines exactly where a spokesperson does not appear.

Where it does not help is the part of UGC that carries the sale: a real-seeming person talking to camera about your product, in your buyer's accent, holding the thing in their hand. That is a different job, and it is the one most performance ads are actually built on. Our comparison of AI video ad platforms covers where each type of tool fits.

UGC creators each holding a different product up to the camera
Novoads · UGC video ads with AI, ready in minutes.
Try now

Where It Sits Next to the Models Novoads Already Runs

A new frontier video model is not a threat to a model-agnostic workflow. It is a new option for it.

Google's Veo 3.1 is already in the stack

Novoads already runs Google Veo 3.1 as one of its video models, next to Seedance 2.0, Kling v3 Pro, Sora 2, and Sora 2 Pro. So a fresh Google video release is not foreign territory; it is the next entry in a lineup the workflow already speaks to. If you want the texture of how these models differ, our explainers on what Seedance 2.5 does and the open-weights LTX-2.3 model show how fast the frontier is moving, and how similar the headline pitches have become: every new model now promises synchronized audio and better physics.

Model-agnostic beats single-model

That similarity is the argument for not betting an ad workflow on any single model. The one that wins your next test is unpredictable, and it changes as new releases land. A tool that lets you pick per shot beats one welded to a single engine. Here is how a few of them map to the actual job in an ad:

ModelBest-fit ad jobRuns in Novoads today
Gemini Omni FlashShort scene shots with soundNot yet (Google API / fal)
Google Veo 3.1Premium scene and product shotsYes
Seedance 2.0Cinematic product motionYes
Sora 2 / Sora 2 ProStylized remix scenesYes
Kling v3 ProMotion-heavy B-rollYes
Talking actorA person selling to cameraYes

The row that matters most is the last one. Every model above it makes scenes; only the talking-actor job makes the spokesperson, and that is the format most UGC ads are built around.

The part no raw model gives you

A raw API call, whether it is Omni Flash on the Gemini API or a model on fal, hands you a clip. It does not hand you an actor library, a script matched to a buyer, lip-sync, a native accent, vertical framing, captions, or the credits accounting to run a hundred variations without babysitting each render. Those are the pieces that stand between a model output and a launched ad, and they are the same pieces whether the underlying model costs ten cents a second or a dollar.

There is also a compliance layer that the raw price hides. Both new Google models tag their output with SynthID watermarking, and that is not a footnote for advertisers. TikTok, Meta, and other platforms increasingly ask you to disclose AI-generated content, and the label follows the file. A workflow that already thinks about format, captions, and disclosure is doing work the model call does not. The trend line is clear: as generation gets cheaper and more realistic, the value shifts up the stack, toward the layer that assembles, localizes, and ships the ad responsibly. That is the layer a raw model does not touch, and it is the reason a falling per-second price is good news for tools that live above the model, not a threat to them.

How Novoads Turns Model Horsepower Into an Ad

The frontier keeps getting cheaper. What it does not do is assemble itself into something a buyer clicks. That assembly is the product.

Script and actor, not just a prompt

In Novoads, you write or auto-generate a script and pick an AI actor whose age, gender, and accent match your audience, and it produces a UGC-style vertical video with voice, lip-sync, and captions, formatted 9:16 for TikTok, Reels, and Meta. Headline time to a finished clip is about four minutes. You can also upload a product photo and turn it into an ad creative. The model underneath is a swappable component; the workflow is what makes the output an ad instead of a demo.

Native-local, in the buyer's own accent

The differentiator is depth, not just model count. Novoads makes native-local video ads in 30-plus languages with real regional accents, so a clip sounds like a creator from the buyer's own city rather than a translated script read by a generic voice. A physics-aware scene model does not solve that; the accent and the delivery are a different axis entirely, and it is the one that decides whether a testimonial reads as trustworthy. As platforms tighten disclosure rules for AI-generated ads, that native-local trust signal becomes more valuable, not less.

Volume is the real unlock

The reason to care about cheap models at all is testing, and testing means volume. A finished Novoads clip runs from roughly $2 to $11 depending on the model, which is a fraction of a hired shoot and cheap enough to run the many variations paid social actually rewards. You can produce your first AI UGC ad with Novoads for $1: it is a recurring $1 every three days that becomes the $49-a-month Inicial plan, and that first charge grants enough credits for about one video. Cancel anytime. The point is not one perfect clip. It is never running out of angles to test.

Several UGC creators filming product variations to camera
Novoads · UGC video ads with AI, ready in minutes.
Try now

The Model Is the Cheap Part

Gemini Omni Flash is a real step: video and sound from one prompt, better physics, a dollar a clip. If your ads need scene shots and B-roll, it belongs in your toolkit, and it will keep getting cheaper and better. That is worth being excited about.

But the launch also clarifies something the low price tends to hide. When the raw clip costs a dollar, the value has already moved somewhere else: to the script angle, the actor the buyer believes, the accent, the format, and the discipline to test until one version beats the benchmark. The model is the cheap part. The ad is everything you build around it, and that is exactly the part a workflow, not a prompt, is for.

Frequently Asked Questions

What is Gemini Omni Flash?

Gemini Omni Flash is a Google model that creates video with synchronized audio from a text prompt, grounded in Gemini's real-world knowledge and with improved physics understanding for more coherent motion. Google launched it on June 30, 2026 for developers via the Gemini API and Google AI Studio, and it is also listed on fal. It currently generates clips up to ten seconds long.

How much does Gemini Omni Flash cost?

Google prices Gemini Omni Flash at $0.10 per second of video output, which The Decoder noted matches Veo 3.1 Fast. Because clips currently cap at ten seconds, a full clip is about a dollar of raw model output. That is the model call only; it does not include the script, actor, editing, captions, or the volume of variations a real ad test needs.

What is Nano Banana 2 Lite?

Nano Banana 2 Lite is the fast image model Google launched alongside Gemini Omni Flash. It delivers a text-to-image output in about four seconds and costs $0.034 per 1K image. Google recommends using it as a high-speed image generator and then passing that image to Gemini Omni Flash to animate it into video.

Does Gemini Omni Flash generate audio on its own?

Yes. The model's headline capability is that it creates the picture and synchronized audio together from a single text input, rather than handing you a silent clip that you score and voice separately. Google is rolling out the developer-facing generation first; some audio and speech editing controls have been staged more cautiously.

Can I use Gemini Omni Flash inside Novoads?

Not as of this writing. Novoads runs Google Veo 3.1, Seedance 2.0, Kling v3 Pro, Sora 2, and Sora 2 Pro as its video models, plus a talking-actor engine for spokesperson clips. Gemini Omni Flash is a new Google model available through the Gemini API and fal, and it is not one of the models Novoads offers today. The point of the workflow is that it stays model-agnostic, so frontier models get added as they earn their place.

Does a cheaper video model replace UGC ads?

No. A ten-second scene with sound is a component, not an ad. A UGC ad still needs a hook, a script angle, an actor whose face and accent match the buyer, vertical framing, captions, and enough variations to find the winner. Cheaper raw video lowers one line item; it does not remove the work that turns a clip into something that converts.

Key Takeaways

  • Google launched Gemini Omni Flash on June 30, 2026: it generates video with synchronized audio directly from a text prompt, grounded in Gemini's real-world knowledge and tuned for better physics so motion holds together.
  • It is priced at $0.10 per second of video output and currently caps at 10-second clips, which makes a full clip land at about a dollar of raw model cost.
  • It shipped with Nano Banana 2 Lite, a fast image model that returns a picture in about 4 seconds for $0.034, and Google pitches chaining the two: image first, then animate it.
  • The raw clip is the cheapest line in an ad budget. The expensive part is everything around it: the script angle, the actor, the accent, the 9:16 format, and the volume to test.
  • Novoads already runs Google's Veo 3.1 alongside Seedance 2.0, Kling, and Sora inside an ad-making workflow, so a new frontier model lowers the floor on raw video without replacing the wrapper that turns a clip into a converting ad.
Mauricio Valdivia

Mauricio Valdivia

Founder of Novoads

Mauricio is the founder of Novoads, where he works to democratize video advertising with AI for brands in Latin America.