1School of Computer Science and Technology, Beijing Institute of Technology
2State Key Laboratory of General Artificial Intelligence, BIGAI
3Deep Glint 4Zhejiang University 4The University of Hong Kong
(1) This paper focuses on recovering 3D human motions with absolute world positions from monocular inputs.
💃
(2) The key idea is to use a diffusion-based multi-view lifting framework that leverages homologous 2D data pre-training to improve 3D motion capture quality.💪
(3) To recover absolute positions in world coordinates, we propose a new representation that decouples local pose and global movement, and then encode the ground plane to accelerate convergence.🎉
(a) Traditional framework for direct 3D motion regression. (b) Mocap-2-to-3: our multi-view lifting framework from monocular input which leverages 2D pretraining to enhance 3D motion capture. (c) The model outputs global motions with absolute position from monocular 2D pose input while maintaining OOD generalization capability. (d) The model also supports COCO-format keypoint estimation.
Recovering absolute human motion from monocular inputs is challenging due to two main issues. First, existing methods depend on 3D training data collected from limited environments, constraining out-of-distribution generalization. The second issue is the difficulty of estimating metric-scale poses from monocular input.
To address these challenges, we introduce Mocap-2-to-3, a novel framework that performs multi-view lifting from monocular input by leveraging 2D data pre-training, enabling the reconstruction of metrically accurate 3D motions with absolute positions.
To leverage abundant 2D data, we decompose complex 3D motion into multi-view syntheses. We first pretrain a single-view diffusion model on extensive 2D datasets, then fine-tune a multi-view model using public 3D data to enable view-consistent motion generation from monocular input, allowing the model to acquire action priors and diversity through 2D data.
Furthermore, to recover absolute poses, we propose a novel human motion representation that decouples the learning of local pose and global movements, while encoding geometric priors of the ground to accelerate convergence. This enables progressive recovery of motion in absolute space during inference.
Experimental results on in-the-wild benchmarks demonstrate that our method surpasses state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting superior generalization capability.
Our code will be made publicly available.
Pipeline. During training: (a) We first train an arbitrary single-view 2D Motion Diffusion Model. (b) Its weights are then used to initialize a Multi-view Diffusion Model, conditioned on 2D pose sequences from $V_0$ and pointmaps. During inference, the Multi-view Model generates motions for other views.(c) We compute local poses and global movement to recover global coordinates $(u,v)$ for each view. (d) Multi-view triangulation is then used to synthesize 3D absolute poses, (e) resulting in full-body global human motion.
@article{wang2025mocap,
title={Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining},
author={Wang, Zhumei and Hu, Zechen and Guo, Ruoxi and Pi, Huaijin and Feng, Ziyong and Peng, Sida and Zhou, Xiaowei, Mingtao Pei, Siyuan Huang},
journal={arXiv preprint arXiv:2503.03222},
year={2025}
}