We present
,
the first method for
training video world models that know when they don't know
for
continuous-scale calibrated controllable video synthesis.
Using proper scoring rules,
generates dense confidence
predictions at the subpatch (channel)
level that are physically interpretable and aligned with observations.
Recent advances in generative video models have led to significant breakthroughs in high-fidelity video
synthesis,
specifically in controllable video generation where the generated video is conditioned on text and
action inputs.
This impressive leap in performance has paved the way for broad applications from instruction-guided
video editing
to world modeling in robotics.
Despite these exceptional capabilities, controllable video models often
hallucinate—generating future video frames that are misaligned with physical
reality—which raises
serious
concerns
in many tasks such as robot policy evaluation and planning. However, state-of-the-art video models lack
the
ability to assess and express their confidence, further impeding hallucination mitigation.
To rigorously address this challenge, we propose
,
an uncertainty quantification method for training continuous-scale
calibrated controllable video models
for dense confidence estimation at the subpatch (channel) level,
precisely localizing the uncertainty
in each generated video frame. The effectiveness of our UQ method is underpinned by three core
innovations:
First, our method introduces a novel framework that trains video models for correctness
and calibration
via strictly proper scoring rules.
Second, we estimate the video model's uncertainty in latent space, avoiding training instability and
prohibitive training
costs associated with pixel-space approaches.
Third, we map the dense latent-space uncertainty to interpretable pixel-level
uncertainty in the RGB space for
intuitive visualization, providing high-resolution uncertainty heatmaps that identify untrustworthy
regions.
Through extensive experiments on large-scale robot learning datasets (Bridge and DROID) and real-world
evaluations,
we demonstrate that our method not only provides calibrated uncertainty estimates within the training
distribution,
but also enables effective out-of-distribution detection.
enables simultaneous video generation and uncertainty quantification,
expressing dense estimates of the model's confidence in the accuracy of each subpatch
in the generated video.
The resulting uncertainty estimates localize potentially untrustworthy regions of the generated video.
We design a transformer-based uncertainty quantification probe
integrated within the video generation pipeline
to estimate the video model’s
confidence
directly in latent space.
This design choice provides a highly-expressive, flexible framework for training video models
efficiently.
To demonstrate the generality of our approach across different model architectures,
we train (i) fixed-scale classification, (ii) multi-class classification,
and (iii) continuous-scale classification models using proper scoring rules and empirically show
calibration
across different test scenarios.
Our method provides dense interpretable visualizations
of the video model's confidence using a confidence heatmap, which
transforms the model's confidence predictions to the RGB color space using a color
map. The red regions of the heatmap highlight areas of high uncertainty,
corresponding to locations where the model is unsure if the
generated video matches the ground-truth video.
In contrast, the blue and green regions represent areas of high confidence
where the model is highly confident about the accuracy (blue areas)
or inaccuracy (green areas) of the generated video.
We conduct experiments on the Bridge and DROID datasets, examining the calibration and interpretability of our method in a broad variety of tasks via the following questions:
produces interpretable uncertainty estimates that are well-calibrated.
Here, we provide visualizations of the dense confidence heatmaps computed by our method,
showing alignment between the estimated uncertainty and accuracy of the
generated videos qualitatively.
Notably, as the deviation of the generated video from the ground-truth increases,
the video model's uncertainty also increases, which is
indicated by the greater intensity of the red region in the heatmaps temporally.
produces
interpretable, calibrated uncertainty estimates at a fine-grained
level,
capturing
non-confident regions of the video which contain hallucinations,
such as object appearing and disappearing, distortions, physically-inconsistent
interaction dynamics, and occlusions.
For example, our method identifies areas with inaccurate blurry background,
rigid object deformation or enlogation, and uncertainty due to unobservable
properties, e.g., mass, friction, etc.
Here, we visualize the ground-truth, generated, composited uncertainty, and
uncertainty heatmap
videos, highlighting the effectiveness of
on the Bridge and DROID datasets.
Here, we explore the performance of
in out-of-distribution (OOD) detection at
inference
time,noting the importance of calibrated uncertainty estimates in reliable OOD
detection.
We consider OOD conditions across five axes:
background; lighting; environment clutter; target object
(task); and robot end-effector, creating environment
settings that are noticeably different from those seen
in the Bridge dataset.
Under these conditions, we see that the video model struggles to generate
accurate videos,
with an observable degradation in the video quality over time.
Despite the distribution shift, our method captures the increasing
uncertainty of the video model, both spatially and temporally.
@misc{mei2025worldmodelsknowdont,
title={World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty},
author={Zhiting Mei and Tenny Yin and Micah Baker and Ola Shorinwa and Anirudha Majumdar},
year={2025},
eprint={2512.05927},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.05927},
}