The state of the art in AI film production pipelines
Our current pipeline produces 8-second video clips that don't gel — characters drift, lighting mismatches, motion doesn't carry from one shot to the next. The films feel like slideshows, not films. This is the research brief we compiled to fix that: how the best AI filmmakers in the world actually make convincing films in 2026, written as a practical implementation guide we can copy.
- Character consistency across shots
- Scene, location and lighting continuity
- Motion continuity
- Model selection per shot type
- Prompt engineering for video
- Shot-list design
- Editing and post-production
- Production pipeline architecture
- Quality gates
- Real-world case studies
- Specific recommendations for bMovies
- Sources
01Character consistency across shots
Character consistency is the single biggest unsolved problem in AI filmmaking. The 2025 survey Generative AI for Film Creation rates “consistent character body movement” as the top priority among AI filmmakers at 6.62/7 — but current tools deliver it at only 4.45/7. A 35% performance gap against demand. Every technique below is a workaround for that gap.
Reference-image conditioning
Every major model has converged on the same mechanism: give the model 2–5 clean reference images of your character at the start of every generation.
Runway Gen-4 References is the most mature: up to 3 images (front portrait, three-quarter, full-body / wardrobe, optional style grade) memorised as a session-level identity. Attach the same references every single generation — no persistent state between sessions.
Veo 3.1 Ingredients works similarly. Assemble 2–3 character references plus a lookbook of 8–10 frames defining lens, lighting and palette. Repeat identical vocabulary across every prompt to reduce semantic drift.
Grok Imagine Video Reference-to-Video (what bMovies uses today) accepts 1–7 reference images via URL alongside the text prompt. @image1, @image2 notation specifies distinct visual elements that synthesise into one scene. No built-in session state — every call re-uploads references.
Kling Elements (Kling 2.5 / 3.0) lets you upload character references as reusable session elements. Kling 3.0 (Feb 2026) introduced multi-shot sequences of 3–15s with subject consistency across camera angles — the first model to solve “same character from multiple angles” within a single generation call.
LoRA and character fine-tuning
Higgsfield Soul ID is the most purpose-built implementation: upload 20+ photos, train a Soul ID in ~3 minutes, generate unlimited visuals with that identity locked across every style, pose and lighting. Cinema Studio 2.5 integrates Soul Cast natively for up to 3 characters, placing consistent characters at the centre of the workflow before the first frame. This is the closest thing to a solved problem in character consistency as of Q1 2026.
Starting-frame / ending-frame conditioning (image-to-video)
The most practical technique. Generate a reference still of your character in the correct pose / lighting / wardrobe. Feed as the starting frame of an image-to-video model. Frame-0 character appearance is guaranteed.
Kling O1 accepts both start and end image and synthesises motion between them — the strongest bidirectional conditioning available.
Generate stills first → validate character appearance → use stills as image-to-video seed frames. Never prompt raw text-to-video for character-critical shots.
Practical checklist for a 20+ shot film
- Build a character bible — front, three-quarter, profile, full-body, two expressions. Six reference images minimum.
- Build a visual lookbook — 8–10 frames defining lens focal length, lighting, palette, grain.
- Write a canonical descriptor string — “Lena, female, early 30s, shoulder-length black hair, pale complexion, red scarf, leather jacket.” Verbatim in every prompt.
- For API pipelines: upload references once per session, reference in every call.
- Generate the reference still per shot first; use it as the image-to-video seed.
- Apply a unified colour grade in post to paper over residual drift.
02Scene, location and lighting continuity
World-building before generating
Establish every environment as concept art before video generation begins. For each location generate 4–6 still images covering different angles and times of day — the environment reference library. Veo 3.1 Ingredients takes character + environment in the same call; Runway Gen-4 combines character, location and style-grade frame as three separate inputs.
Palette locking
Pick a palette at the start and encode it in every prompt: “teal–orange grade, neon magenta accent, filmic shadows, warm practical light sources.” Repeat verbatim. Tolerate ~15% histogram drift before flagging a shot as mismatched; beyond that, regenerate.
LUT-style grading in post
- Designate one shot as the grade anchor — typically the first hero shot.
- Use DaVinci Resolve's Color Match / Shot Match to pull all other clips to the anchor.
- Apply a film LUT on top (Film Riot, Ground Control, paid packs).
- Use AI colour matching (Color.io) to generate a 3D LUT from a reference frame and apply across all clips.
The 2025 norm: apply an AI-adaptive LUT first (70–90% there), then finish with manual node corrections on faces and highlights.
Avoiding “every shot is a different building”
- Reuse the starting still — the environment in the seed frame carries into the video.
- Prop anchoring — place 1–2 distinctive props in every shot from a location (“the cracked neon sign on the left wall”).
- Camera-angle vocabulary — same angle + focal-length terms across all shots in a location.
- Don't regenerate anchors — once a location shot is approved, lock it and build all other shots in the scene around its starting frame.
03Motion continuity
Last-frame → first-frame conditioning
The most reliable technique. Generate shot N to a clean final frame (ideally without motion blur). Extract. Feed as starting frame for shot N+1 with a motion directive: “continues stepping into light.” Preserves subject orientation, clothing state, spatial position, approximate lighting. Does not preserve velocity — include velocity language explicitly.
Veo 3.1 supports this natively. Warning: “image-to-video methods can produce stagnation or quality degradation when applied autoregressively” — chain too many shots from a single starting frame and you accumulate artifacts. Recommended: hard-cut every 3–4 shots and re-anchor from a fresh keyframe.
Start-frame + end-frame (Kling O1)
Most controlled motion of any current model. Generate both frames as stills first, then use O1 to fill in the motion between them.
Keyframe interpolation
Midjourney (for keyframe stills) → Veo 3.1 interpolation is a documented 2025 workflow: generate clean start and end keyframes in Midjourney, submit to Veo's interpolation endpoint. Veo calculates the vector flow between frames, producing natural motion with predictable physics.
Practical motion-continuity checklist
- Respect the 180-degree rule across all shots in a scene — pick an axis of action, never cross it without a bridging shot.
- Maintain screen direction — if a character walks left→right in shot 4, they must continue left→right in shot 5 unless there's an explicit cut-to reason for reversal.
- Match eyelines — if A looks left in close-up, B must appear on the left in the reverse.
- Use match-action cues in prompts: “mid-swing,” “hand reaching the doorknob,” “one foot already through the door.”
- Metric: measure optical flow at shot boundaries; significant directional deviation is the objective indicator of broken motion continuity.
04Model selection per shot type
As of Q1 2026, no single model is best at everything. Average AI film productions use 3.14 different tools per film.
Google Veo 3.1
Strengths: native synchronised audio (the only model that does this natively); lip-sync accuracy <120ms; temporal consistency 8.9/10; best photorealism for talking heads; Ingredients references; interpolation between start / end keyframes.
Weaknesses: $0.15/sec (Fast) or $0.40/sec (standard) — the most expensive major model; native audio limits post-production flexibility; complex dialogue and musical instruments still produce inconsistent results.
Kling 2.5 / 3.0
Strengths: best character consistency across shots among API-accessible models; multi-shot sequences (3–15s) with subject consistency across camera angles; dual-keyframe (O1) for controlled motion arcs; mature image-to-video pipeline; ~$0.10/sec.
Weaknesses: Asian-character aesthetics bias; occasional over-sharpening; motion physics less convincing than Veo / Runway for physics-heavy shots.
Runway Gen-4.5
Strengths: best-in-class character consistency for multi-scene workflows via References; Director Mode that parses cinematography terminology; Motion Brush for per-pixel control of movement paths (unique); Act-Two for motion-capture-driven performance; the most complete professional editing platform.
Weaknesses: subscription-based credits pricing; less photorealistic than Veo for talking heads; not optimal for audio.
xAI Grok Imagine Video (what bMovies uses today)
Strengths: native audio; xAI API with no cold-start; $0.30/6s or $0.50/10s; 1–7 reference images for identity preservation; Aurora autoregressive architecture; integrates with the rest of the xAI stack.
Weaknesses: character consistency across separate generations is its documented weak point — no session state, every call re-supplies references; Morpheus benchmark found inconsistent physics simulation; warped objects, flickering lighting and unnatural movement are documented artifacts; 6–10s clip cap; no post-gen editing control. Positioned as a consumer tool, not a professional film pipeline anchor.
Hailuo 02 (MiniMax)
Strengths: #2 globally on Artificial Analysis (92.1); physics realism highest of any model (94/100 vs Veo 3's 83/100) — specialised physics critics trained on 120k labeled clips with PyBullet integration; native 1080p at $0.28/10s; deterministic cinematography from natural-language camera commands.
Weaknesses: no audio generation or lip-sync; 10s duration cap; struggles with fine-grained details in complex prompts.
Luma Ray 3
Strengths: world's first model with native HDR (16-bit EXR export); scene-level global illumination tracks highlights and shadows across frames; generation <90s at 1080p.
Weaknesses: less character-consistency tooling than Runway / Kling; stronger for B-roll and environment shots.
Selection matrix
| Shot type | Primary | Backup |
|---|---|---|
| Dialogue / talking head | Veo 3.1 | Runway Act-Two |
| Character action (multi-shot) | Kling 3.0 | Runway Gen-4.5 |
| Physics-heavy action | Hailuo 02 | Kling 3.0 |
| Establishing / environment | Luma Ray 3 | Hailuo 02 |
| Character-consistent close-ups | Higgsfield Cinema Studio | Runway Gen-4.5 |
| Motion-capture performance | Runway Act-Two | — |
| B-roll / product | Pika 2.5 | Grok Imagine Video |
05Prompt engineering for video
The universal structure
Every prompt for every model should include these five layers, in order:
[Identity] [Cinematography] [Environment] [Performance/Action] [Style]
Example: “Lena, female, early 30s, black hair, red scarf — medium shot, 35mm, shallow depth of field, slow dolly push — rain-slicked alley, warm practical light from shopfront on left — turns to face camera, wide-eyed, backs against wall — cinematic, film grain, teal-orange grade, 24fps.”
- Identity — exact character descriptor (canonical descriptor string, copy-paste).
- Cinematography — shot size (ECU/CU/MS/MLS/WS), lens (24/35/50/85mm), camera movement (dolly/pan/tilt/crane/handheld/static), DoF.
- Environment — location, time of day, weather, lighting sources, key props.
- Performance — what the character does, how they move, emotional state.
- Style — film grain, LUT / grade direction, codec aesthetic (35mm, anamorphic, DV, 16mm).
Amateur vs. cinematic
Amateur: “A woman walks down a dark street at night.”
Cinematic: “Lena, early 30s, shoulder-length black hair, red scarf — medium shot, 50mm lens, tracking dolly alongside at hip height — rain-slicked cobblestone street, single amber streetlamp behind her casting long forward shadow, reflections fragmenting in puddles — walks with deliberate pace, head slightly bowed against rain, occasionally glances left — cinematic, 35mm film grain, desaturated cold with warm amber highlights, 24fps.”
Difference: physical specificity, camera specification, light sources (not just “dark”), character physical behaviour rather than abstract mood.
Model-specific tips
- Runway treats prompts as physics specifications. Use force / mass language. Motion Brush: paint foreground and background as separate motion vectors.
- Kling is an audio-visual choreographer; include timeline beat markers. Elements for reusable references. Negative prompts: “no morphing, no floating, no jitter.”
- Veo is a structured data renderer; separate cinematography, lighting, subject, environment, audio into distinct clauses to prevent concept bleed.
- Hailuo parses natural-language camera commands deterministically. Write “orbit 180°, dolly-zoom in, handheld shake” as you would brief a DP.
- Grok: prompt motion directives over composition (the reference image handles composition). Physics negatives: “no flickering, no warping, solid and rigid.”
06Shot-list design
The classical structure (still applies to AI)
- Master shot (wide / long) — establishes characters, spatial relationships, environment. Generate first — visual bible for the scene.
- Coverage — medium shots of each character, individually and combined.
- Close-ups / extreme close-ups — faces, hands, significant props.
- Over-the-shoulder — A's face over B's shoulder; creates intimacy and sightline.
- Cutaways and inserts — details, environmental context, reaction shots.
For AI: generate the master shot first; use it as the reference frame for every other shot in the scene.
The Kuleshov effect and AI
Meaning in film is created by the juxtaposition of shots, not the content of individual shots. A neutral face cut after a bowl of soup reads as “hunger”; the same face after a coffin reads as “grief.”
You need fewer shots than you think if your shot selection is intentional. A five-minute film needs 40–80 shots, not 200.
Screen direction and the 180-degree rule
Mark an imaginary axis of action before generating any shots in a dialogue scene. All camera positions must remain on one side. For AI: choose “left” or “right” facing for each character and encode it in every prompt. Bridging shots (straight-on / neutral angle close-up) allow crossing the line without disorientation.
07Editing and post-production
Colour-grading pipeline (DaVinci Resolve)
- Import all clips into one timeline.
- Designate one anchor shot (usually the first hero).
- Use Color Match / Shot Match to pull all clips to the anchor.
- Apply a single film LUT across the entire timeline on a shared node — imposes aesthetic unity.
- Shot-level corrections for faces, highlights, problem shots.
For AI video the LUT does more work than in live action because model-to-model variance is higher. A desaturated, slightly pushed mid-tone look (teal shadows, warm highlights) is the standard “cinematic AI film” grade.
Audio
- Music — Suno for songs with vocals; ElevenLabs Music for instrumental with precise BPM / key control and stem exports; MusicGen (Meta, open source) for background instrumental with audio conditioning.
- Voice / ADR — ElevenLabs cloned from 1–3 minutes of reference audio. Clone once per character; use throughout.
- SFX — ElevenLabs SFX (text-to-SFX) or Freesound + manual curation. Veo 3.1 native audio can supply ambient SFX as a byproduct.
Edit techniques that work for AI video
- Match cut — find a shot that ends on a shape / motion / direction that rhymes with the next shot's opening. Hides model inconsistencies.
- J-cut / L-cut — audio from the next / previous scene starts before / after the picture cut. Audio continuity gives the brain a bridge.
- Dissolve — 8–16 frame dissolves between thematically related shots mask model variance better than hard cuts.
- Cutaways — environmental detail shots give you breathing room to re-establish characters after a generation break.
- Colour continuity — the single most powerful “gel” in AI film is a unified colour grade. Most “disconnected clips” problems are ~50% colour-temperature mismatch, solvable in 30 minutes of grading.
08Production pipeline architecture
Agent roles and sequential dependencies
Script → Scene plan → Character bible → Storyboard →
Reference stills → Video generation → Assembly →
Audio → Colour grade → Final output
| Stage | Agent role | Output |
|---|---|---|
| 1 | Writer | Scene-by-scene script with action lines, dialogue, scene headers |
| 2 | Director | Shot list per scene: type, angle, duration, screen direction, axis of action |
| 3 | Production designer | Character bibles, location lookbooks, colour palette |
| 4 | Storyboard | Panels (image gen from shot list) with camera notes |
| 5 | DP | Reference stills (one per shot, validated for character consistency) |
| 6 | Camera operator | Video generation from reference stills (image-to-video) |
| 7 | Editor | Assembly, cut decisions, transition choices, rough cut |
| 8 | Composer / sound | Music, SFX, ADR / voice |
| 9 | Colourist | Grade, LUT application, final polish |
Text-first vs. storyboard-first vs. reference-frame-first
- Text-first (current bMovies approach): writer → prompts go directly to video. Problem: no reference images, character drifts immediately.
- Storyboard-first: writer → director generates shot list → image model generates storyboard panels (one per shot) → panels become image-to-video seeds → video generation. 2025 standard. Adds ~20% cost; delivers consistency that makes the film watchable.
- Reference-frame-first (Higgsfield Cinema Studio): create the character and environment first, then write the shot list around what's achievable, then generate. Most consistent but least narratively flexible.
Storyboard-first. Our 28-step pipeline needs a “generate reference stills” step before video generation. That one architectural change is the biggest single lever.
09Quality gates
Automated checks to implement
- Face consistency — extract a frame from each shot, run face embeddings (OpenCV dlib, DeepFace, Insightface), cosine similarity >0.85 for same character. Flag <0.7 for regeneration.
- Motion artifact detection — extract first / last frame, optical flow analysis, flag clips where flow reverses abruptly mid-clip. SSIM drops >30% usually indicate a generation seam.
- Colour / palette consistency — compute colour histogram of dominant region, flag divergence >15% from scene anchor.
- Audio-visual sync (Veo 3.1, Kling native audio) — detect speech onset, verify mouth-open state at corresponding frame. Flag divergence >3 frames (125ms at 24fps).
- Duration / motion sanity — reject clips <90% of requested duration or with near-zero optical flow in first / last 10% of frames.
Shy Kids used ~1 in 300 Sora generations in their final film. A reasonable production standard is generating 5–10 variations per shot and selecting 1. Budget for a 5:1 generation-to-use ratio minimum.
10Real-world case studies
Shy Kids — “Air Head” (2024)
Sora → manual edit → colour grade → human VO and music. What worked: designing the protagonist as an intentionally surreal character (man with balloon head) that could tolerate identity drift. Lesson: design characters around what AI can maintain, not around realism. What didn't: ~300 generations for <2 minutes. 50% of shots required manual VFX.
“There isn't the feature set in place yet for full control over consistency — the closest we could get was being hyper-descriptive in prompts.”
Paul Trillo — “The Golden Record” (2024)
Sora → 11 clips with light colour correction → assembly. Achieved a vintage 16mm aesthetic from raw Sora output by developing specific stylistic prompt language. Transferable: invest in finding the prompt language that produces your target aesthetic, then apply it to every single shot.
Don Allen Stevenson — “Beyond Our Reality” (2024)
Started from random generative outputs and built narrative retroactively. When a fox crow had two legs instead of four, he kept it. Lesson: building narrative around what the AI produces well is faster than fighting the AI.
The 2025 AI Film Festival (Runway)
6,000 submissions vs. 300 in 2023. No fully automated pipeline won without human intervention. Common thread in winners: unified colour grade, intentional shot selection, audio that matched visual mood.
OpenMontage (open-source)
The most complete open-source AI film pipeline as of early 2026: 52 production tools, 7-stage workflow, 14 video providers, 10 image providers, 4 TTS providers. Key architectural decision: the agent is the orchestrator, not code. YAML pipeline manifests and Markdown stage-director instructions.
11Specific recommendations for bMovies
The core problem
We generate clips via Grok Imagine Video with no reference-image conditioning and no storyboard step. Each clip is a cold text-to-video call with no visual anchor. The result is guaranteed drift.
The 80 / 20 fix
- Add a reference-still step before video generation (highest impact). Generate a still of the character in the shot's pose / lighting / environment using Grok Imagine Image. Call Grok Imagine Video with that still as a reference (1–7 accepted). Locks character appearance at frame 0. ~$0.10 per shot added.
- Supplement Grok with Kling for character-critical shots. Kling 2.5 image-to-video via API (kie.ai / fal.ai / piapi.ai at $0.10/sec or less). Feed the reference still from Fix 1 as source image. Keep Grok for B-roll and scenes without named characters.
- Apply a unified LUT in post. DaVinci Resolve (free). Pick one hero shot as anchor. Apply a cinematic LUT across the timeline. This single step gels the film more than any generation-side change.
- Generate 3–5 variants of each shot and select the best. Alone increases usable output quality by 40–60%.
The full recommended pipeline
Step 1: Script parsing → JSON scene/shot structure (existing)
Step 2: Character Bible — 6 reference images per character
Step 3: Location Lookbook — 4 reference images per location
Step 4: Shot list — type, angle, action, duration, screen direction
Step 5: Storyboard still per shot — seeded from character / location refs
Step 6: Quality-check storyboard — face similarity >0.85, palette >85%
Step 7: Video generation — image-to-video from storyboard still
Kling 2.5 for character shots; Hailuo 02 for physics;
Grok Imagine Video for B-roll; Veo 3.1 for dialogue
Step 8: Quality-check video — motion artifact, face consistency
Generate 3 variants per shot, score and select best
Step 9: Audio — ElevenLabs voices, Suno music, SFX
Step 10: Assembly — rough cut (DaVinci Resolve or ffmpeg)
Step 11: Colour grade — anchor shot grade + film LUT across timeline
Step 12: Audio mix — voice, SFX, music at correct levels
Step 13: Export + on-chain mint
Model budget for a 5-minute film (~60 shots)
| Step | Model | Cost |
|---|---|---|
| Character Bible (6 × 3 chars) | Grok Imagine Image | ~$1.50 |
| Location Lookbooks (4 × 5 locs) | Grok Imagine Image | ~$2.00 |
| Storyboard stills (60 × 3 variants) | Grok Imagine Image | ~$9.00 |
| Character video (30 × 8s, Kling) | Kling 2.5 | ~$24.00 |
| B-roll / physics (20 × 8s) | Hailuo 02 / Grok | ~$6.00 |
| Dialogue (10 × 8s, Veo Fast) | Veo 3.1 Fast | ~$12.00 |
| Voice (3 characters) | ElevenLabs | ~$5.00 |
| Music | Suno | ~$3.00 |
| Total | ~$62.50 |
Current Grok-only pipeline: ~$30 for the same film, with significantly lower usable output. Delta: ~$32 for a film that has character consistency and feels like a coherent production.
Key code changes
- Add
generate_reference_stillstep (before video generation). - Add
select_model_for_shot(shot)— routes by shot type: character-heavy → Kling; dialogue → Veo; physics → Hailuo; B-roll → Grok. - Add face-similarity check post-generation (DeepFace / Insightface, CPU-OK).
- Add a
color_gradestep in post-production that applies a canonical LUT. - Store character bibles and location lookbooks as persistent artifacts; don't regenerate per shot.
›Sources
The full source list with grouping by topic lives at docs/research/sources.md. Key references: