Monocular motion model comparison

2025-11-25 19:51 UTC gpt-5-1-thinking Open in ChatGPT ↗

Yes – by late 2025 there are models that report higher accuracy than GVHMR, but it depends on which setting you care about (global/world-grounded vs camera-space, single-shot vs multi-shot, biomechanics, etc.).

Here’s the short version by category.

1. World-grounded global motion (GVHMR’s main setting)

GVHMR (SIGGRAPH Asia 2024) introduced Gravity-View coordinates and was clearly SOTA at the time for world-grounded motion from monocular video, outperforming WHAM and others on RICH/EMDB and 3DPW while being quite fast. citeturn1search0turn5view0

Since then:

WATCH (WATCH: World-aware Allied Trajectory and pose reconstruction for Camera and Human, 2025)

WATCH jointly models camera and human trajectories and directly compares against GVHMR and WHAM on the same benchmarks and metrics.

On EMDB-2 (world space), using DPVO camera trajectories:

GVHMR: WA-MPJPE₁₀₀ ≈ 111.0 mm, W-MPJPE₁₀₀ ≈ 276.5 mm, RTE ≈ 2.0 m
WATCH (w/ cam traj.): WA-MPJPE₁₀₀ ≈ 107.6 mm, W-MPJPE₁₀₀ ≈ 272.2 mm, RTE ≈ 1.9 m, and lower jitter/foot-sliding. citeturn7view0

With GT gyro data, WATCH improves further (WA-MPJPE₁₀₀ ≈ 106.4 mm vs 109.1 for GVHMR, and lower RTE and jitter). citeturn7view0

In camera space (EMDB-1 / RICH / 3DPW), WATCH also shows slightly better MPJPE / PVE and smoother motion than GVHMR on most benchmarks, while using a similar training recipe. citeturn7view0

The authors explicitly state that WATCH achieves superior global and camera-space performance compared to GVHMR on RICH and EMDB. citeturn7view0

👉 Conclusion: In the same “world-grounded from monocular video” regime, WATCH is numerically more accurate than GVHMR on the standard global-motion benchmarks (RICH, EMDB, 3DPW) while also improving smoothness.

HumanMM (CVPR 2025): multi-shot monocular global motion

HumanMM: Global Human Motion Recovery from Multi-shot Videos targets multi-shot sequences (multiple cuts) from a single camera and compares directly to SLAHMR, WHAM, and GVHMR on their ms-Motion benchmark.

On ms-Motion (multi-shot AIST & Human3.6M), the table shows:

GVHMR row vs HumanMM row: HumanMM achieves substantially lower PA-MPJPE, WA-MPJPE, RTE and ROE across shot counts, indicating more accurate global trajectories and orientations than GVHMR on this benchmark. citeturn4view0

👉 Conclusion: For multi-shot monocular global motion, HumanMM is clearly more accurate than GVHMR on the published benchmark.

TRAM / PromptHMR-vid & SLAM-heavy methods

WATCH’s paper notes that some camera-trajectory–centric approaches like TRAM and variants of PromptHMR-vid can achieve even better numerical global-motion metrics on some datasets by relying heavily on high-precision SLAM trajectories. However, they may suffer from physically implausible poses and discontinuities, so “better accuracy” depends on whether you care more about raw error numbers or physical realism. citeturn7view0turn6view0

2. Camera-space HMR (pose/shape only, no world trajectory)

If by “human motion recovery” you mainly mean pose and shape in camera coordinates from monocular images/videos, there are also newer, more accurate methods than GVHMR:

PromptHMR (2025)

PromptHMR is a transformer-based, promptable HMR model that processes full images and accepts spatial & semantic prompts (boxes, masks, language). The paper reports state-of-the-art accuracy for camera-space pose/shape on standard HPS benchmarks (e.g., 3DPW) with both image and video versions. citeturn9search1

It doesn’t focus on world-grounded trajectories; it’s more about flexible, highly accurate 3D pose/shape from monocular inputs, especially in crowded/complex scenes.

BioPose (WACV 2025) – biomechanical accuracy

BioPose aims at biomechanically accurate 3D poses from monocular videos, combining an MQ-HMR backbone, a Neural IK stage, and 2D-informed refinement. It shows significantly reduced joint errors against motion-capture ground truth and prior learning-based HPE/HMR methods on biomechanics benchmarks. citeturn1search5turn1search11

This is a different notion of “accuracy” (matching real joint locations / kinematics rather than 3DPW-style metrics), but if you care about biomechanics, BioPose can be “more accurate” in that sense than general HMR models like GVHMR.

3. Big picture answer

So, as of November 2025:

Yes – for world-grounded monocular motion, both WATCH (2025) and HumanMM (CVPR 2025) report better quantitative performance than GVHMR on their respective benchmarks, under comparable conditions. citeturn4view0turn7view0
For camera-space pose/shape, methods like PromptHMR surpass older HMR models on standard 3DPW / Human3.6M metrics, and BioPose improves biomechanical realism relative to prior work. citeturn9search1turn1search11
GVHMR is still a strong, widely-used baseline and is often used inside later pipelines (e.g., as a motion estimator), but it’s no longer the clear top performer on global-motion benchmarks.

If you tell me your exact use-case (single long shot vs multi-shot, static vs moving camera, need for strict biomechanics, etc.), I can suggest which of WATCH / HumanMM / PromptHMR / BioPose (or combinations) is likely the best fit.

In late 2025, the “state of the art” for multi-view, uncalibrated RGB human capture is basically a small family of methods that:

Self-calibrate the cameras (often also time-sync them) from the videos themselves, and
Jointly reconstruct humans + sometimes the scene.

Here’s the landscape, focusing on human pose/mesh/motion (RGB only).

1. Fully calibration-free motion capture pipelines (RGB, often unsynchronized)

🔹 Kineo (2025) – probably the current SOTA if you want “plug in many phones and get metric mocap”

Paper: Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras (Oct 2025). citeturn2academia20turn1view0
Inputs: unsynchronized, uncalibrated, consumer RGB cameras (e.g., phones, GoPros).
Outputs:
- Full intrinsics + extrinsics + lens distortion (Brown–Conrady)
- Metric-scale 3D keypoints, dense scene pointmap, confidence for each 3D keypoint.
Pipeline:
- Audio-based temporal synchronization.
- Keypoint-driven multi-view SfM + bundle adjustment to estimate cameras & 3D.
- Metric scale via monocular depth prior or SMPL prior.
Reported gains vs previous calibration-free methods (on EgoHumans, Human3.6M):
- ~83–85% lower camera translation error,
- ~86–92% lower angular error,
- 83–91% lower world MPJPE (W-MPJPE). citeturn2academia20

So if your question is “what’s the current best general-purpose uncalibrated multi-view RGB mocap system?”, Kineo is the one to beat.

Related / earlier in this family

HSfM – Humans and Structure from Motion (2024)
- Joint reconstruction of humans + scene + cameras from sparse uncalibrated images; uses DUSt3R/MASt3R-style pointmaps + SMPL priors for metric scale. citeturn2academia21turn2search0
HAMSt3R (ICCV 2025)
- Human-aware multi-view stereo from sparse uncalibrated images, extending MASt3R with human semantics, strong for dense human+scene reconstruction. citeturn2search4turn2search5
SmartMocap (2022/23)
- Joint estimation of human & camera motion with static or moving uncalibrated RGB cameras, using a ground-plane world frame. citeturn0search14turn0search7

These are still very relevant, but Kineo explicitly compares to HSfM/HAMSt3R and improves accuracy + speed.

2. Uncalibrated multi-view human reconstruction (pose/mesh) – mostly image-set or static-camera

These usually assume a few RGB views (often synchronized) but no known intrinsics/extrinsics.

🔹 MUC (AAAI 2025) – Mixture of Uncalibrated Cameras for Robust 3D Human Body Reconstruction

Uses a pretrained human body encoder per view (SMPL-/SMPL-X-based).
Predicts camera poses + human body parameters per view, then learns a mixture-of-experts fusion over views for each joint + mesh surface, handling occlusion and arbitrary numbers of cameras. citeturn5search0turn5search8
Strong results on Human3.6M and RICH (SMPL-X, with detailed hands/face), designed for ad-hoc multi-camera setups.

🔹 Visibility-Guided Human Body Reconstruction (ICMR 2024)

Visibility-guided Human Body Reconstruction from Uncalibrated Multi-view Cameras.
Focuses on fusing multi-view RGB in presence of occlusions, using explicit visibility modelling to improve mesh quality. citeturn7search0turn7search1

🔹 Multiview Human Body Reconstruction from Uncalibrated Cameras (NeurIPS 2022)

Classic reference in this space: uses dense keypoint semantics (canonical body surface coordinates) to align and fuse features from uncalibrated views. citeturn0search5turn0search30

Together, MUC + Visibility-guided + Yu et al. (NeurIPS 2022) represent the SOTA for multi-view uncalibrated mesh/pose reconstruction when cameras are roughly synchronized and mainly static.

3. Uncalibrated multi-view multi-person 3D tracking / motion capture

If you want multi-person tracking in a shared world frame from uncalibrated RGB:

🔹 EasyRet3D (WACV 2025) – Uncalibrated Multi-View Multi-Human 3D Reconstruction and Tracking

Builds on strong single-view HMR2.0 + PHALP per camera.
Then:
1. Cross-view association using 3D appearance features,
2. Automatic camera extrinsic + intrinsic estimation via PnP on reconstructed humans,
3. Global stitching + joint optimization (human + cameras + ground plane). citeturn3view0
Does not need pre-calibrated cameras and is designed for crowded scenes with static cameras.

🔹 Dynamic Multi-Person Mesh Recovery from Uncalibrated Multi-View Cameras (DMMR, ~2021)

Simultaneously recovers camera parameters and human meshes using human semantics and motion priors. Often cited as a baseline in later uncalibrated multi-person work. citeturn7search4

These are closer to “drop some cameras around a sports field / room and reconstruct everyone w/out calibration”.

4. Direct “extrinsic-free” / learning-based multi-view models

Some work doesn’t explicitly solve for camera poses, but designs the network so that extrinsics aren’t needed at inference:

FLEX (ECCV 2022) – Extrinsic Parameters-free Multi-view 3D Human Motion Reconstruction
- Learns to fuse multi-view features and predicts view-consistent joint rotations + bone lengths, exploiting pose/shape invariants so you don’t need extrinsics. citeturn2academia23turn2search16
Deep NRSfM for multi-view multi-body pose estimation (2024)
- NRSFM-style approach leveraging multiple uncalibrated cameras to jointly estimate human poses and camera parameters. citeturn0search9

These are more “black-box” deep models: less explicit SfM, more learned invariances.

5. Pure calibration from people (often used as a front end)

A whole mini-field is “use humans as the calibration pattern”:

Spatiotemporal Multi-Camera Calibration using Freely Moving People (RA-L 2025)
- From uncalibrated, possibly unsynchronized multi-view videos, jointly estimates camera extrinsics, time offsets, and person associations using 3D poses on a unit sphere + probabilistic registration. citeturn2academia22
Humans as a Calibration Pattern / Human Pose as Calibration Pattern – earlier works using 2D/3D human pose as the calibration target for multiple uncalibrated cameras. citeturn0search22turn0search27

These are often combined with downstream reconstruction models (e.g., plug their calibration into a standard multi-view HMR system).

6. How to think about “state of the art” here

If you’re specifically after multi-view uncalibrated RGB for human motion / pose, then:

Best end-to-end calibration-free mocap (unsync + metric + scene):
→ Kineo (2025) — SOTA quantitative numbers vs previous calibration-free baselines, plus good efficiency. citeturn2academia20
Best uncalibrated multi-view mesh reconstruction for single or few people (static cams, image sets):
→ MUC (AAAI 2025), Visibility-Guided (ICMR 2024), and Yu et al. NeurIPS 2022 as main references. citeturn5search0turn7search0turn0search30
Best uncalibrated multi-person tracking + reconstruction:
→ EasyRet3D (WACV 2025) and DMMR-style methods. citeturn3view0turn7search4
Best pure calibration tools that you can plug into your pipeline:
→ Spatiotemporal Multi-Camera Calibration using Freely Moving People (RA-L 2025) and related “human as calibration pattern” work. citeturn2academia22turn0search22

If you tell me your constraints (static vs moving cameras, need metric scale or not, online vs offline, single vs multi-person), I can help narrow this to 1–2 concrete methods + codebases that are the best fit for your use case.

Yes 👍

There is code for Kineo, and it’s public:

The project page explicitly says: “The full pipeline and evaluation code are openly released … at liris-xr.github.io/kineo” citeturn0search2
That page links to the GitHub repo: github.com/liris-xr/kineo, which is live and already has a README and install instructions. citeturn1view0turn2view0

From the README:

It’s a toolkit for 3D human pose estimation built on MMPretrain, MMEngine, MMCV, MMDet, and MMPose. citeturn2view0
They give a conda-based install recipe and notes about:
- building MMCV with CUDA (set MMCV_WITH_OPS=1, etc.),
- using RTMDet + RTMPose as the 2D detector/keypoint backbone,
- optional SMPL and MoGe models for metric scaling. citeturn2view0

The repo currently doesn’t have tagged releases yet and may still be evolving, so expect some rough edges, but yes: you can already clone it and start playing.