C3 - Video World Models That Know When They Don't Know

We present , the first method for training video world models that know when they don't know
for continuous-scale calibrated controllable video synthesis. Using proper scoring rules, generates dense confidence predictions at the subpatch (channel) level that are physically interpretable and aligned with observations.

Abstract

Recent advances in generative video models have led to significant breakthroughs in high-fidelity video synthesis, specifically in controllable video generation where the generated video is conditioned on text and action inputs. This impressive leap in performance has paved the way for broad applications from instruction-guided video editing to world modeling in robotics. Despite these exceptional capabilities, controllable video models often hallucinate—generating future video frames that are misaligned with physical reality—which raises serious concerns in many tasks such as robot policy evaluation and planning. However, state-of-the-art video models lack the ability to assess and express their confidence, further impeding hallucination mitigation. To rigorously address this challenge, we propose , an uncertainty quantification method for training continuous-scale calibrated controllable video models for dense confidence estimation at the subpatch (channel) level, precisely localizing the uncertainty in each generated video frame. The effectiveness of our UQ method is underpinned by three core innovations: First, our method introduces a novel framework that trains video models for correctness and calibration via strictly proper scoring rules. Second, we estimate the video model's uncertainty in latent space, avoiding training instability and prohibitive training costs associated with pixel-space approaches. Third, we map the dense latent-space uncertainty to interpretable pixel-level uncertainty in the RGB space for intuitive visualization, providing high-resolution uncertainty heatmaps that identify untrustworthy regions. Through extensive experiments on large-scale robot learning datasets (Bridge and DROID) and real-world evaluations, we demonstrate that our method not only provides calibrated uncertainty estimates within the training distribution, but also enables effective out-of-distribution detection.

Training Video Models to Express their Uncertainty via

enables simultaneous video generation and uncertainty quantification, expressing dense estimates of the model's confidence in the accuracy of each subpatch in the generated video. The resulting uncertainty estimates localize potentially untrustworthy regions of the generated video. We design a transformer-based uncertainty quantification probe integrated within the video generation pipeline to estimate the video model’s confidence directly in latent space. This design choice provides a highly-expressive, flexible framework for training video models efficiently. To demonstrate the generality of our approach across different model architectures, we train (i) fixed-scale classification, (ii) multi-class classification, and (iii) continuous-scale classification models using proper scoring rules and empirically show calibration across different test scenarios.

Our method provides dense interpretable visualizations of the video model's confidence using a confidence heatmap, which transforms the model's confidence predictions to the RGB color space using a color map. The red regions of the heatmap highlight areas of high uncertainty, corresponding to locations where the model is unsure if the generated video matches the ground-truth video. In contrast, the blue and green regions represent areas of high confidence where the model is highly confident about the accuracy (blue areas) or inaccuracy (green areas) of the generated video.

Experiments

We conduct experiments on the Bridge and DROID datasets, examining the calibration and interpretability of our method in a broad variety of tasks via the following questions:

Are uncertainty estimates interpretable?

produces interpretable uncertainty estimates that are well-calibrated. Here, we provide visualizations of the dense confidence heatmaps computed by our method, showing alignment between the estimated uncertainty and accuracy of the generated videos qualitatively. Notably, as the deviation of the generated video from the ground-truth increases, the video model's uncertainty also increases, which is indicated by the greater intensity of the red region in the heatmaps temporally.

Ground-Truth

Generated

Composited

Uncertainty Maps

Can detect hallucinations?

produces interpretable, calibrated uncertainty estimates at a fine-grained level, capturing non-confident regions of the video which contain hallucinations, such as object appearing and disappearing, distortions, physically-inconsistent interaction dynamics, and occlusions. For example, our method identifies areas with inaccurate blurry background, rigid object deformation or enlogation, and uncertainty due to unobservable properties, e.g., mass, friction, etc. Here, we visualize the ground-truth, generated, composited uncertainty, and uncertainty heatmap videos, highlighting the effectiveness of on the Bridge and DROID datasets.

Bridge Experiments

Ground-Truth

Generated

Composited

Uncertainty Maps

DROID Experiments

Ground-Truth

Generated

Composited

Uncertainty Maps

Can detect Out-of-Distribution Inputs at Inference?

Here, we explore the performance of in out-of-distribution (OOD) detection at inference time,noting the importance of calibrated uncertainty estimates in reliable OOD detection. We consider OOD conditions across five axes: background; lighting; environment clutter; target object (task); and robot end-effector, creating environment settings that are noticeably different from those seen in the Bridge dataset. Under these conditions, we see that the video model struggles to generate accurate videos, with an observable degradation in the video quality over time. Despite the distribution shift, our method captures the increasing uncertainty of the video model, both spatially and temporally.

Ground-Truth

Generated

Composited

Uncertainty Maps

BibTeX


@misc{mei2025worldmodelsknowdont,
      title={World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty}, 
      author={Zhiting Mei and Tenny Yin and Micah Baker and Ola Shorinwa and Anirudha Majumdar},
      year={2025},
      eprint={2512.05927},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.05927}, 
}

The website design was adapted from Nerfies.