Skip to main content

What Is LTX-2.3? The Open-Weights AI Video Model You Run Yourself

LTX-2.3 is Lightricks' open-weights audio-video foundation model: it generates synchronized sound and picture in one model and runs on your own hardware. Here is what it is, how its architecture works, and how it compares to the models you can run for ads today.

Mauricio Valdivia

Mauricio Valdivia

·10 min

A small creative team reviewing an AI-generated video ad on a laptop in a warm, minimal studio

Most AI Video Models You Rent. This One Downloads.

Open the page for almost any frontier AI video model and the first thing you find is a button that says "Get API key." You rent access by the second, your prompts travel to someone else's servers, and the model itself stays a black box you are never allowed to hold. That is the default, and for most ad teams it is a perfectly fine default.

Lightricks took the other road. LTX-2.3 is the newest checkpoint in its open-weights LTX-2 family, an audio-video foundation model you can download in full and run on your own machine. Its own model card describes it as a model that generates synchronized video and audio together, "with open weights and a focus on practical, local execution." The underlying LTX-2 was introduced in a paper submitted in January 2026, and 2.3 is the current, upgraded release. Here is what it actually is, how the architecture pulls sound and picture out of one model, and how it stacks up against the managed models you can put to work on an ad this afternoon.

What LTX-2.3 Actually Is

Strip away the version number and LTX-2.3 is a single idea executed unusually openly: one model that produces a finished clip with its own soundtrack, shipped as weights you can keep. Three details make it worth a closer look than a routine point release.

One model, sound and picture together

Most of the video tools an ad team has used generate silent footage that you score, voice, and sync afterward. LTX-2.3 is built the other way. Lightricks calls it "a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model." The audio is not a second tool bolted on after the render. It is generated jointly with the picture, which is why a generated voice can land on a generated mouth without a separate lip-sync pass. For UGC-style ads, where a believable talking person is the whole format, sound and picture coming out aligned is the part that usually eats an afternoon.

Open weights, made to run locally

The second detail is distribution. LTX-2.3 ships under the LTX-2 Community License Agreement with the weights published in the open, so you download the checkpoint and run it yourself. There is no license fee to self-host, and there is no per-clip API meter ticking while you iterate. Lightricks pairs that with an API Playground for anyone who wants to try it without a local setup, but the headline is that the model is yours to keep and run. That is a genuinely different ownership model from the hosted endpoints that power most of the field.

An update, not a reset

LTX-2.3 is not a from-scratch model. Lightricks frames it as "a significant update to the LTX-2 model with improved audio and visual quality as well as enhanced prompt adherence." Two of those three matter directly for ads: better audio quality means the voice track needs less cleanup, and better prompt adherence means fewer regenerations to get the shot you described. The model also grew between versions. The LTX-2 checkpoints are 19B parameters; the LTX-2.3 checkpoints are 22B. If you have tracked the model family before, treat 2.3 as the same architecture, turned up. This is the same lens we apply to every release, like the recent Seedance 2.5 explainer.

A UGC creator filming a skincare product review on a phone
Novoads · UGC video ads with AI, ready in minutes.
Try now

How LTX-2.3 Works Under the Hood

You do not need to read the model code to make ads, but the architecture explains why the audio lands where it does, so it is worth one section. The short version: two specialist networks share a brain.

The asymmetric dual-stream transformer

LTX-2 is built around what Lightricks describes as "the asymmetric dual-stream LTX-2 transformer (14B-parameter video stream, 5B-parameter audio stream) with bidirectional cross-modal attention for joint audio-video processing." Read that slowly. There are two streams, not one. The bigger stream, 14 billion parameters, handles the pixels. A smaller 5-billion-parameter stream handles the sound. Video carries far more information per second than audio, so giving the picture more capacity than the soundtrack is a deliberate, sensible asymmetry rather than an accident.

Why the two streams talk to each other

The phrase that does the work is "bidirectional cross-modal attention." The two streams are not run side by side and stapled together at the end. They attend to each other while they generate, so the audio is shaped by what the video is doing and the video is shaped by what the audio is doing. That is the mechanism behind synchronized output: the mouth and the voice are produced by networks that can see each other, not stitched in post. For a spokesperson read or a product demo with a voiceover, that joint generation is the difference between a clip that feels recorded and one that feels assembled.

A multilingual text encoder

How does the model understand your prompt? Through a text encoder, and Lightricks uses a "Gemma 3-based multilingual encoder with multi-layer feature extraction and thinking tokens" that produces separate embeddings for the video and the audio. The multilingual part is the quiet, ad-relevant bit. A model whose prompt understanding is built on a multilingual encoder is positioned to take direction in more than one language, which is exactly the lever you want when one creative concept has to ship across markets, the same problem we walk through in AI for advertising.

The Open-Weights Tradeoff

Open weights are the headline, so it is worth being honest about both sides of the trade. Downloading a model is not the same as running one cheaply, and pretending otherwise helps nobody plan a budget.

What open weights actually get you

The upside is real, and it is not only a price story. Holding the weights yourself unlocks three things a rented endpoint almost never offers:

  • Fine-tune the base model on your own footage. The dev checkpoint is published as fully trainable, so the model can learn your category instead of guessing at it.
  • Train LoRA adapters for a recurring spokesperson, a specific product, or a house look you reuse across every campaign.
  • Run unlimited generations on hardware you control, with no external meter ticking while you iterate toward the shot.

For a studio with a signature style and an engineer to maintain it, that control is the entire reason to choose an open model over a rented one. You are not buying clips, you are buying a model you can bend.

The hardware bill is the catch

Now the other side. A 22-billion-parameter audio-video model is heavy, and the cost you avoid at the API does not vanish, it moves to your own GPU and your own time. Lightricks clearly knows this, because the repository is full of ways to make a big model fit a smaller machine. It ships FP8 quantization to shrink the memory footprint, and a block-streaming mode that, in its words, "streams transformer blocks through the GPU one block at a time, so the full model runs on machines without enough memory to hold all its weights at once." Those are excellent engineering answers. They are also a tell: running this well is an engineering project, not a signup.

Fast mode: the distilled model

Speed is the third axis. Alongside the full model, Lightricks publishes a distilled checkpoint described as "the distilled version of the full model, 8 steps, CFG=1." Eight steps is fast. A full diffusion run can take dozens of denoising steps, so a distilled eight-step model is the difference between iterating in seconds and waiting around. The pattern is familiar from the rest of the field: keep the heavy model for final quality, reach for the distilled one while you are still finding the shot.

What It Can Do That Matters for Ads

Architecture is interesting; capabilities pay the bills. Setting aside the raw text-to-video generation everyone expects, LTX-2.3 ships a set of pipelines that map cleanly onto real ad tasks:

  • Lip dub and re-voice an existing performance into a new script or a new language.
  • Audio-to-video, so a finished voiceover can drive the picture instead of the other way around.
  • Retake a single time region of a clip rather than regenerating the whole thing.
  • Camera-control LoRAs for named moves, like a dolly-in on a product or a locked-off static shot.

Each of those is worth a closer look, because each maps to a job an ad team already does by hand.

Lip dubbing and re-voicing

The most ad-shaped feature is lip dubbing. Lightricks publishes a LipDub pipeline for "lip dubbing, rephrasing, matching speaker identity," which is the exact operation behind a localized spot: take an existing performance, change the words, and reshape the mouth so the new audio looks spoken rather than dubbed. For a global advertiser, re-voicing one hero clip into several languages without it reading as a bad overdub is the difference between one ad and a catalog of them, and it is the same wedge that separates AI from a human UGC creator.

Audio-to-video and retakes

Two more pipelines matter for production rhythm. An audio-to-video pipeline conditions the generation on an input audio file, so you can start from a voiceover and let the picture follow the read. And a retake pipeline regenerates a specific time region of an existing video, which means a near-perfect take with one bad second becomes a one-second fix instead of a full regeneration and a fresh roll of the dice. That single capability is what turns a generation from a slot-machine pull into something closer to an editable asset.

Camera moves on demand

Finally, control. Lightricks publishes a set of camera-control LoRAs (dolly in, dolly out, jib up, jib down, static, and more) plus motion and pose control adapters. For an ad, a directed push-in on a product or a steady locked-off shot for a testimonial is not garnish, it is grammar. Being able to call a camera move by name, rather than praying a prompt produces one, is the kind of control that separates a usable creative tool from a novelty.

Real UGC creators talking to camera in a row of video cards
Novoads · UGC video ads with AI, ready in minutes.
Try now

Where It Sits vs the Models You Can Run Today

Here is the unhedged take: LTX-2.3 is the most interesting open release in the category, and it is the wrong first tool for most people making ads this quarter. Those two things are both true, and the table below is why. The axis is not quality, it is operating model.

LTX-2.3 (open weights)Managed models on Novoads
How you get itDownload the weightsHosted, no download
Where it runsYour own GPUThe platform's cloud
SetupCUDA, PyTorch, FP8Upload a photo, write a script
AudioNative, joint in-modelVeo native audio
Cost shapeHardware plus timePer-clip credits
Best forTeams that self-hostMarketers shipping today

The contrast is not about which model is smarter. It is about who does the work of running it. With LTX you own the model and the machine, which is power if you have the engineers and overhead if you do not. With a managed model you own neither, and in exchange you skip the entire infrastructure question.

Use each when

For a quick decision, skip the spec sheet and match the tool to the seat you sit in:

  • Reach for LTX-2.3 when you have GPU access and engineering time, you want to fine-tune or train LoRAs on a signature look, and keeping the weights in-house matters more than shipping the first ad fast.
  • Reach for a managed model when you need a finished ad this week, you would rather upload a product photo than configure CUDA, and availability and speed are themselves the feature, which is how AI fits a real advertising toolkit rather than becoming a side project.

The point is not loyalty to open or hosted. It is matching the operating model to the team, the same way you would compare the AI video ad platforms before betting a quarter on one.

How Novoads Solves the Same Job Without the Setup

LTX-2.3 is exciting, and to be straight about it: it is not on Novoads, and we are not going to pretend otherwise. Novoads runs a curated set of managed models, Seedance 2.0, Kling v3 Pro, Sora 2 and Sora 2 Pro, and Veo 3.1, and the whole job it does is delete the setup that LTX hands you. There is no GPU to provision and no pipeline to wire. In Novoads the flow is built to skip the infrastructure entirely:

  • Upload a product photo and write or auto-generate a script, with no timeline to edit and no config to manage.
  • Pick an AI actor from more than 100 to hold and present the product on camera.
  • Generate a vertical ad with voice, lip-sync, and captions in about four minutes.
  • Localize the same spot across more than 30 languages with real regional accents.

The cost is concrete instead of estimated, because there is no hardware to amortize. A 5-second Seedance clip is about 3 credits, roughly $2, and heavier models land between there and about $11, still a fraction of the $200 to $500 a human creator charges per deliverable. You choose the model per placement, a fast Seedance clip for volume testing or Veo 3.1 when you need native sound, so the workflow stays the same while the engine swaps underneath. If the UGC format itself is new to you, our guide on what a UGC creator is covers the hook-demo-payoff shape these ads are built to carry, and the same logic applies whether the spot runs on Meta or as a TikTok ad.

One honest caveat by the button: the trial is $1 for 3 days of access, then $49/month, and you can cancel anytime. It is a paid trial, not a giveaway, and it grants enough credits for roughly one video so you can see your own product in a finished ad before you commit.

A UGC creator filming a product review without a film crew
Novoads · UGC video ads with AI, ready in minutes.
Try now

The weights are open. The bottleneck moved.

LTX-2.3 is a real milestone, but it is easy to misread what kind of milestone it is. It does not make a finished ad cheaper to ship for most teams, because the API bill it removes is replaced by a GPU bill and an engineering project. What it actually does is hand the model itself, weights, audio, control, and all, to the people who want to own their stack rather than rent it. That is a gift to studios and researchers, and a distraction for a marketer who just needs ten variations by Friday. The open question was never whether the weights would open. It was where the work would go once they did, and the answer is clear: it moved from the render to the rig. Until you want to run the rig, make the ads you need now, and let the open-weights frontier prove itself on someone else's GPU.

Frequently Asked Questions

What is LTX-2.3 in one sentence?

It is Lightricks' open-weights audio-video foundation model that generates synchronized video and audio inside a single model and is built to run locally on your own hardware. LTX-2.3 is the newest checkpoint in the LTX-2 family, which was presented in a paper submitted in January 2026.

Does LTX-2.3 cost anything to run?

The weights are open and there is no license fee to download and self-host them under the LTX-2 Community License Agreement. What you pay for is the hardware and the work: a capable GPU and the engineering time to run a 22-billion-parameter model. So it is open to self-host, not zero-cost to operate.

What is the difference between LTX-2 and LTX-2.3?

Lightricks describes 2.3 as a significant update to LTX-2 with improved audio and visual quality and enhanced prompt adherence. The model also grew: the LTX-2 checkpoints are 19B parameters, while the LTX-2.3 checkpoints are 22B.

Can I use LTX-2.3 in Novoads?

No. Novoads does not run LTX. It runs managed models behind a no-setup workflow: Seedance 2.0, Kling v3 Pro, Sora 2 and Sora 2 Pro, and Veo 3.1. LTX is something you self-host or run on a platform that hosts it, then bring the finished clip wherever you edit.

What can LTX-2.3 do that matters for ads?

It generates native audio with the video, dubs lips and re-voices a speaker, turns an audio track into video, retakes a single time region of an existing clip, and steers the camera with control LoRAs. It also uses a multilingual text encoder, which matters if you localize one spot across markets.

How much does an AI video ad cost to make?

It depends on the model. On Novoads a 5-second Seedance clip is about 3 credits, roughly $2, while heavier models like Veo 3.1 or a one-minute talking actor run closer to $7. There is no single price for a video, the range is about $2 to $11, which is still a fraction of the $200 to $500 a human UGC creator charges per deliverable.

Key Takeaways

  • LTX-2.3 is Lightricks' open-weights audio-video foundation model. It generates synchronized video and audio inside a single model, and you download the weights and run them on your own hardware instead of calling a hosted API.
  • It is a significant update to LTX-2, with improved audio and visual quality and better prompt adherence. The architecture is an asymmetric dual-stream transformer: a 14B-parameter video stream and a 5B-parameter audio stream that attend to each other.
  • Open weights are a real lever (download, fine-tune, train LoRAs, no per-clip API bill), but the catch is hardware. The 22B model needs a capable GPU, so Lightricks ships FP8 quantization and block streaming to make local runs feasible.
  • For ad work it carries useful pipelines: lip dubbing and re-voicing, audio-to-video, retaking a time region of a clip, and camera-control LoRAs. The distilled model runs in 8 steps for fast inference.
  • Novoads does not run LTX. What it runs today is managed models (Seedance 2.0, Kling v3 Pro, Sora 2 and Sora 2 Pro, Veo 3.1) behind a no-setup workflow, where a 5-second Seedance clip is about 3 credits, roughly $2.
Mauricio Valdivia

Mauricio Valdivia

Founder of Novoads

Mauricio is the founder of Novoads, where he works to democratize video advertising with AI for brands in Latin America.