MOCHI - Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence

1 Institute of Robotics, Athena RC, Greece  ·  2 School of ECE, NTUA, Greece
3 HERON, Athens, Greece  ·  4 Max Planck Institute for Intelligent Systems, Tubingen, Germany
5 ETH Zurich, Switzerland  ·  6 Google, Switzerland
tl;dr: MOCHI learns to output FLAME registrations from calibrated multi-view images, without needing registrations during training. An optional test-time optimization step further improves fidelity.

Teaser

Given calibrated multi-view images, MOCHI predicts a canonical FLAME-topology mesh without requiring precomputed registrations during training.

Video

Abstract

Recent frameworks like ToFu and TEMPEH provide an automated alternative to classical registration pipelines by predicting 3D meshes in dense semantic correspondence directly from calibrated multi-view images. However, these learning-based methods still rely on registration-heavy supervision pipelines. MOCHI addresses this by training directly on raw scans without requiring registered training data. It enforces topological consistency through a pseudo-linear inverse kinematic solver and uses dense semantic guidance from a 2D keypoint predictor trained only on synthetic data. We further replace unstable point-to-surface supervision with pointmap- and normal-based losses for smoother gradients and improved fidelity. An optional test-time optimization stage refines each sample in a few dozen iterations, bridging feed-forward efficiency and iterative precision.

  1. @inproceedings{filntisis2026mochi,
      title = {MOCHI: Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence},
      author = {Filntisis, Panagiotis P. and Retsinas, George and Danecek, Radek and Sklyarova, Vanessa and Maragos, Petros and Bolkart, Timo},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year = {2026},
    }