Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining

Zhumei Wang^1,2 Zechen Hu³ Ruoxi Guo⁴ Huaijin Pi^4,5 Ziyong Feng³ Sida Peng⁴ Xiaowei Zhou⁴ Mingtao Pei¹ Siyuan Huang²

¹School of Computer Science and Technology, Beijing Institute of Technology
²State Key Laboratory of General Artificial Intelligence, BIGAI
³Deep Glint   ⁴Zhejiang University   ⁴The University of Hong Kong

Paper

Code

TL;DR

(1) This paper focuses on recovering 3D human motions with absolute world positions from monocular inputs. 💃
(2) The key idea is to use a diffusion-based multi-view lifting framework that leverages homologous 2D data pre-training to improve 3D motion capture quality.💪
(3) To recover absolute positions in world coordinates, we propose a new representation that decouples local pose and global movement, and then encode the ground plane to accelerate convergence.🎉

Abstract

(a) Traditional framework for direct 3D motion regression. (b) Mocap-2-to-3: our multi-view lifting framework from monocular input which leverages 2D pretraining to enhance 3D motion capture. (c) The model outputs global motions with absolute position from monocular 2D pose input while maintaining OOD generalization capability. (d) The model also supports COCO-format keypoint estimation.

Recovering absolute human motion from monocular inputs is challenging due to two main issues. First, existing methods depend on 3D training data collected from limited environments, constraining out-of-distribution generalization. The second issue is the difficulty of estimating metric-scale poses from monocular input. To address these challenges, we introduce Mocap-2-to-3, a novel framework that performs multi-view lifting from monocular input by leveraging 2D data pre-training, enabling the reconstruction of metrically accurate 3D motions with absolute positions. To leverage abundant 2D data, we decompose complex 3D motion into multi-view syntheses. We first pretrain a single-view diffusion model on extensive 2D datasets, then fine-tune a multi-view model using public 3D data to enable view-consistent motion generation from monocular input, allowing the model to acquire action priors and diversity through 2D data. Furthermore, to recover absolute poses, we propose a novel human motion representation that decouples the learning of local pose and global movements, while encoding geometric priors of the ground to accelerate convergence. This enables progressive recovery of motion in absolute space during inference. Experimental results on in-the-wild benchmarks demonstrate that our method surpasses state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting superior generalization capability. Our code will be made publicly available.

Method

Pipeline. During training: (a) We first train an arbitrary single-view 2D Motion Diffusion Model. (b) Its weights are then used to initialize a Multi-view Diffusion Model, conditioned on 2D pose sequences from $V_0$ and pointmaps. During inference, the Multi-view Model generates motions for other views.(c) We compute local poses and global movement to recover global coordinates $(u,v)$ for each view. (d) Multi-view triangulation is then used to synthesize 3D absolute poses, (e) resulting in full-body global human motion.

Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining

TL;DR

Video

Abstract

Method

Comparsion video

More Results

More Results

Citation