Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining


Zhumei Wang1,2*  Zechen Hu3*  Ruoxi Guo4  Huaijin Pi5  Ziyong Feng3  Liang Zhang6†  Mingtao Pei1  Siyuan Huang2† 

1Beijing Institute of Technology    2State Key Laboratory of General Artificial Intelligence, BIGAI   
3Deep Glint   4Zhejiang University   5The University of Hong Kong   6Shandong Agricultural University

CVPR 2026

TL;DR


(1) This paper focuses on recovering 3D human motions with absolute world positions from monocular inputs. 💃
(2) The key idea is to use a diffusion-based multi-view lifting framework that leverages homologous 2D data pre-training to improve 3D motion capture quality.💪
(3) To recover absolute positions in world coordinates, we propose a new representation that decouples local pose and global movement, and then encode the ground plane to accelerate convergence.🎉

Video



Abstract


(a) Traditional framework for direct 3D motion regression. (b) Mocap-2-to-3: our multi-view lifting framework from monocular input which leverages 2D pretraining to enhance 3D motion capture. (c) The model outputs SMPL-format global motions with absolute position from monocular 2D pose input while maintaining out-of-distribution generalization capability. (d) Our model also supports outputs in the COCO-format keypoint.

Human motion recovery for real-world interaction demands both precise action details and metric-scale trajectories. Recovering absolute human pose from monocular input presents a viable solution, but faces two main challenges: (1) models' reliance on 3D training data from constrained environments limits their out-of-distribution generalization; and (2) the inherent difficulty of estimating metric-scale poses from monocular observations. This paper introduces Mocap-2-to-3, a novel framework that differs from prior HMR methods by recovering absolute poses from monocular input and leveraging abundant 2D data to enhance 3D motion recovery. To effectively utilize the action priors and diversity in large-scale 2D datasets, we reformulate 3D motion as a multi-view synthesis process and divide the training into two stages: a single-view diffusion model is first pre-trained on extensive 2D data, followed by multi-view fine-tuning on 3D data, thus achieving a combination of strong priors and geometric constraints. Furthermore, to recover absolute poses, we introduce a novel human motion representation that decouples the learning of local pose and global movements, while encoding ground geometric priors to accelerate convergence, thereby yielding more precise positioning in the physical world. Experiments on in-the-wild benchmarks show that our method outperforms state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting strong generalization capability.

Method


Pipeline. During training: (a) We first train an arbitrary single-view 2D Motion Diffusion Model. (b) Its weights are then used to initialize a Multi-view Diffusion Model, conditioned on 2D pose sequences from $V_0$ and pointmaps. During inference, the Multi-view Model generates motions for other views.(c) We compute local poses and global movement to recover global coordinates for each view. (d) Multi-view triangulation is then used to synthesize 3D absolute poses, (e) resulting in full-body global human motion.


Comparsion video





More Results





More Results





Citation


@article{wang2025mocap,
  title={Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining},
  author={Wang, Zhumei and Hu, Zechen and Guo, Ruoxi and Pi, Huaijin and Feng, Ziyong and Zhang, Liang and Pei, Mingtao and Huang, Siyuan},
  journal={arXiv preprint arXiv:2503.03222},
  year={2025}
}