HumANDiff: Articulated Noise Diffusion
for Motion-Consistent Human Video Generation


Stability AI

TL;DR: HumANDiff is a model-agnostic framework that generates temporally consistent human videos from a single image by modeling noise directly on the 3D human body, enabling intrinsic motion control. Our method produces stable motion and preserves fine-grained details (e.g., clothing and identity), outperforming prior approaches without additional motion modules.

Abstract

Despite tremendous recent progress in human video generation, generative video diffusion models still struggle to capture the dynamics and physics of human motions faithfully. In this paper, we propose a new framework for human video generation, HumANDiff, which enhances the human motion control with three key designs: 1) Articulated motion-consistent noise sampling that correlates the spatiotemporal distribution of latent noise and replaces the unstructured random Gaussian noise with 3D articulated noise sampled on the dense surface manifold of a statistical human body template. It inherits body topology priors for spatially and temporally consistent noise sampling. 2) Joint appearance-motion learning that enhances the standard training objective of video diffusion models by jointly predicting pixel appearances and corresponding physical motions from the articulated noises. It enables high-fidelity human video synthesis, e.g., capturing motion-dependent clothing wrinkles. 3) Geometric motion consistency learning that enforces physical motion consistency across frames via a novel geometric motion consistency loss defined in the articulated noise space. HumANDiff enables scalable controllable human video generation by fine-tuning video diffusion models with articulated noise sampling. Consequently, our method is agnostic to diffusion model design, and requires no modifications to the model architecture. During inference, HumANDiff enables image-to-video generation within a single framework, achieving intrinsic motion control without requiring additional motion modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance in rendering motion-consistent, high-fidelity humans with diverse clothing styles.

Method Overview

HumANDiff is a model-agnostic framework that achieves high-fidelity, motion-consistent human video generation through three synergistic designs.



Core Technical Designs

  • Articulated Noise Sampling: Correlates the spatiotemporal distribution of latent noise by sampling on the dense surface manifold of a human body template (e.g., SMPL).
  • Joint Appearance-Motion Learning (JAML): Enhances training by jointly predicting pixel appearances and physical motions via a Motion Decoder.
  • Geometric Motion Consistency Learning (GMCL): Enforces physical motion consistency across frames using a novel geometric loss defined in the articulated noise space.

Key Advantages

  • Intrinsic Motion Control: Enables controllable human animation within a single framework without requiring additional motion modules or adapters.
  • Model-Agnostic & Efficient: Requires no modifications to the base model architecture (e.g., CogVideoX, Wan), maintaining the same memory and runtime overhead.

Comparisons

Unlike prior methods that rely on additional motion adapters, HumANDiff achieves comparable or better results within a unified framework.

Insight

Replacing 2D optical flow with articulated 3D noise warping leads to significantly more consistent human motion generation, even without additional training.

Generalization

Our method, trained on a limited set of fashion videos, generalizes well to unseen out-of-domain data and synthesizes human videos with novel poses from a single image across diverse clothing styles (e.g., DeepFashion) and in-the-wild backgrounds.



BibTeX

If you find this project useful for your research, please cite the following:
  @misc{hu2026humandiff,
      title={HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation}, 
      author={Tao Hu and Varun Jampani},
      year={2026},
      eprint={2604.05961},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
  }