IllumiCraft Icon IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation

1University of Oxford    2UC Merced    3NEC Labs America    4Atmanity Inc.    5Google DeepMind   

IllumiCraft takes a prompt and an input video, then edits scene illumination conditioned on the static background image. It supports different lighting setups, including spotlight effects. Below are the results generated by IllumiCraft. For each example, the three columns show the input video, static background image, and generated video.

example attachment 1
example GIF 1
example attachment 1_2
example GIF 1_2
example attachment 2
example GIF 2
example attachment 2_2
example GIF 2_2
example attachment 3
example GIF 3
example attachment 3_2
example GIF 3_2
example attachment 4
example GIF 4
example attachment 4_2
example GIF 4_2
example attachment 5
example GIF 5
example attachment 5_2
example GIF 5_2
example attachment 6
example GIF 6
example attachment 6_2
example GIF 6_2
example attachment 7
example GIF 7
example attachment 7_2
example GIF 7_2
example attachment 8
example GIF 8
example attachment 8_2
example GIF 8_2
example attachment 9
example GIF 9
example attachment 9_2
example GIF 9_2
example attachment 10
example GIF 10
example attachment 10_2
example GIF 10_2
example attachment 11
example GIF 11
example attachment 11_2
example GIF 11_2
example attachment 12
example GIF 12
example attachment 12_2
example GIF 12_2
example attachment 13
example GIF 13
example attachment 13_2
example GIF 13_2
example attachment 14
example GIF 14
example attachment 14_2
example GIF 14_2
example attachment 15
example GIF 15
example attachment 15_2
example GIF 15_2

Abstract

Although diffusion-based models can generate high-quality and high-resolution video sequences from textual or image inputs, they lack explicit integration of geometric cues when controlling scene lighting and visual appearance across frames. To address this limitation, we propose IllumiCraft, an end-to-end diffusion framework accepting three complementary inputs: (1) high-dynamic-range (HDR) video maps for detailed lighting control; (2) synthetically relit frames with randomized illumination changes (optionally paired with a static background reference image) to provide appearance cues; and (3) 3D point tracks that capture precise 3D geometry information. By integrating the lighting, appearance, and geometry cues within a unified diffusion architecture, IllumiCraft generates temporally coherent videos aligned with user-defined prompts. It supports background-conditioned and text-conditioned video relighting and provides better fidelity than existing controllable video generation methods.

Method

Data Collection Mechanism of IllumiPipe. For each input video, our proposed IllumiPipe extracts various types of data: (1) HDR maps, (2) foreground video and mask video, (3) relit video, (4) background video, and (5) 3D tracking video.

IllumiCraft Framework. It uses HDR maps, relit foreground video, 3D tracking, and an optional background image, it models illumination, appearance, and geometry, then generates videos from an illumination-aware text prompt. The figure shows 3 illumination tokens; HDR maps, background images, and 3D tracking videos are all optional during training.

Video

Visual Comparison

Visual results under the text-conditioned setting.

Visual results under the background-conditioned setting.

Performance Comparison

Text-conditioned video relighting.

Background-conditioned video relighting. ∗ denotes results evaluated with the first 16 frames.