Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining


Zhumei Wang1,2  Zechen Hu3  Ruoxi Guo4  Huaijin Pi4,5  Ziyong Feng3  Sida Peng4  Xiaowei Zhou4  Mingtao Pei1  Siyuan Huang2 

1School of Computer Science and Technology, Beijing Institute of Technology   
2State Key Laboratory of General Artificial Intelligence, BIGAI   
3Deep Glint   4Zhejiang University   4The University of Hong Kong

TL;DR


(1) This paper focuses on recovering 3D human motions with absolute world positions from monocular inputs. 💃
(2) The key idea is to use a diffusion-based multi-view lifting framework that leverages homologous 2D data pre-training to improve 3D motion capture quality.💪
(3) To recover absolute positions in world coordinates, we propose a new representation that decouples local pose and global movement, and then encode the ground plane to accelerate convergence.🎉

Video



Abstract


(a) Traditional framework for direct 3D motion regression. (b) Mocap-2-to-3: our multi-view lifting framework from monocular input which leverages 2D pretraining to enhance 3D motion capture. (c) The model outputs global motions with absolute position from monocular 2D pose input while maintaining OOD generalization capability. (d) The model also supports COCO-format keypoint estimation.

Recovering absolute human motion from monocular inputs is challenging due to two main issues. First, existing methods depend on 3D training data collected from limited environments, constraining out-of-distribution generalization. The second issue is the difficulty of estimating metric-scale poses from monocular input. To address these challenges, we introduce Mocap-2-to-3, a novel framework that performs multi-view lifting from monocular input by leveraging 2D data pre-training, enabling the reconstruction of metrically accurate 3D motions with absolute positions. To leverage abundant 2D data, we decompose complex 3D motion into multi-view syntheses. We first pretrain a single-view diffusion model on extensive 2D datasets, then fine-tune a multi-view model using public 3D data to enable view-consistent motion generation from monocular input, allowing the model to acquire action priors and diversity through 2D data. Furthermore, to recover absolute poses, we propose a novel human motion representation that decouples the learning of local pose and global movements, while encoding geometric priors of the ground to accelerate convergence. This enables progressive recovery of motion in absolute space during inference. Experimental results on in-the-wild benchmarks demonstrate that our method surpasses state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting superior generalization capability. Our code will be made publicly available.

Method


Pipeline. During training: (a) We first train an arbitrary single-view 2D Motion Diffusion Model. (b) Its weights are then used to initialize a Multi-view Diffusion Model, conditioned on 2D pose sequences from $V_0$ and pointmaps. During inference, the Multi-view Model generates motions for other views.(c) We compute local poses and global movement to recover global coordinates $(u,v)$ for each view. (d) Multi-view triangulation is then used to synthesize 3D absolute poses, (e) resulting in full-body global human motion.


Comparsion video





More Results





More Results





Citation


@article{wang2025mocap,
  title={Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining},
  author={Wang, Zhumei and Hu, Zechen and Guo, Ruoxi and Pi, Huaijin and Feng, Ziyong and Peng, Sida and Zhou, Xiaowei, Mingtao Pei, Siyuan Huang},
  journal={arXiv preprint arXiv:2503.03222},
  year={2025}
}