FashionEngine: Interactive
3D Human Generation and Editing
via Multimodal Controls

Technical Report 2024

Tao Hu,    Fangzhou Hong,       Zhaoxi Chen,    Ziwei Liu*   
S-Lab, Nanyang Technological University      *corresponding author    

TL;DR: FashionEngine is the first work that constructs an interactive 3D human generation and editing system with multimodal control (e.g., texts, images, hand-drawing sketches) in a unified framework.

With FashionEngine, artist-designers can generate a view-consistent clothed human in three different ways (a), including ① texts describing the human clothing; ② hand-drawing sketches describing the clothing shape such as neckline shape, length of sleeve, and the length of lower clothing on a warped human body template; and ③ random appearance sampling. (b) Users are also allowed to edit the generated human interactively with multimodal control (e.g., texts, reference images, and sketches). (c) Users can adjust the pose and shape of the edited humans and check the renderings from different camera viewpoints before exporting images or video assets. FashionEngine runs at ~9 FPS on an NVIDIA V100.

Abstract

We present FashionEngine, an interactive 3D human generation and editing system that creates 3D digital humans via user-friendly multimodal controls such as natural languages, visual perceptions, and hand-drawing sketches. FashionEngine automates the 3D human production with three key components: 1) A pre-trained 3D human diffusion model that learns to model 3D humans in a semantic UV latent space from 2D image training data, which provides strong priors for diverse generation and editing tasks. 2) Multimodality-UV Space encoding the texture appearance, shape topology, and textual semantics of human clothing in a canonical UV-aligned space, which faithfully aligns the user multimodal inputs with the implicit UV latent space for controllable 3D human editing. The multimodality-UV space is shared across different user inputs, such as texts, images, and sketches, which enables various joint multimodal editing tasks. 3) Multimodality-UV Aligned Sampler learns to sample high-quality and diverse 3D humans from the diffusion prior. Extensive experiments validate FashionEngine's state-of-the-art performance for conditional generation/editing tasks. In addition, we present an interactive user interface for our FashionEngine that enables both conditional and unconditional generation tasks, and editing tasks including pose/view/shape control, text-, image-, and sketch-driven 3D human editing and 3D virtual try-on, in a unified framework.

Full Video Live Demo


*Check the video chapters for easier understanding.



Method Overview

1. 3D Human Prior Learning

1. We utilize a learned 3D human prior Z from a two-stage StructLDM approach that includes a structured autodecoder and a latent diffusion model both learned in a semantic UV latent space.

2. Multimodality-UV Space

2. With the learned prior Z, we construct a Multimodality-UV Space for controllable 3D generation/editing, including an Appearance-Canonical Space, an Appearance-UV Space, a textual Semantics-UV Space, and a geometric Shape-UV Space.

4. Controllable Multimodal Editing

4. Users are allowed to edit a generated human by inputting texts, drawing sketches, or providing a reference image for style transfer. The multimodal editing inputs can be faithfully transferred into UV space to edit the UV latent for 3D editing.

3. Controllable Multimodal Generation

3. Taking as input the texts or sketches in UV space, we present Text-UV Aligned Samplers and Sketch-UV Aligned Samplers to sample latents from the learned human prior Z respectively, which can be rendered into images by latent diffusion and rendering (Diff-Render). At the core of the Text-UV Aligned Samplers are the TextMatch and ShapeMatch modules that we propose for text- or sketch-aligned sampling.

Sketch-based Editing Live Demo

* For people from Mainland China.

Outline: 1. Sketch-based editing 2. Sketch-based search (Matching Style) 3. Image-based Editing 4. Shape/View Ctrl, Animation

Text-based Editing Live Demo

* For people from Mainland China.

Outline: 1. Text-based generation/editing 2. Text-based search (Matching Style) 3. Image-based Editing

Acknowledgments

This study is supported by the Ministry of Education, Singapore, under its MOE AcRF Tier 2 (MOE-T2EP20221-0012), NTU NAP, and under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

BibTeX

  
  @misc{hu2024fashionengine,
      title={FashionEngine: Interactive 3D Human Generation and Editing via Multimodal Controls}, 
      author={Tao Hu and Fangzhou Hong and Zhaoxi Chen and Ziwei Liu},
      year={2024},
      eprint={2404.01655},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
  }
  @misc{hu2024structldm,
      title={StructLDM: Structured Latent Diffusion for 3D Human Generation}, 
      author={Tao Hu and Fangzhou Hong and Ziwei Liu},
      year={2024},
      eprint={2404.01241},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
  }