†Corresponding Author
4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPoser, which features high-quality surface motions and diverse camera trajectories, as well as a synthetic appearance dataset, VarenTex, comprising realistic multi-view images generated through multi-view diffusion. While training only on synthetic datasets, 4DEquine achieves state-of-the-art performance on real-world APT36K and AiM datasets, demonstrating the superiority of 4DEquine and our new datasets for both geometry and appearance reconstruction. Comprehensive ablation studies validate the effectiveness of both the motion and appearance reconstruction network.
The 4DEquine framework processes inputs through two disentangled branches to achieve its final 4D reconstruction. In the motion recovery branch (AniMoFormer), an input video is processed frame-by-frame through a spatial transformer to extract features, which are then passed into a temporal transformer to model local motion context. These features feed into a VAREN transformer decoder and undergo a post-optimization step to output an accurate, temporally smooth sequence of VAREN shape and pose parameters. Concurrently, in the appearance reconstruction branch (EquineGS), the network takes a single input image alongside a canonical up-sampled point cloud initialized from the VAREN template mesh. These inputs are processed through a dual-stream architecture—utilizing DINOv3 to extract multi-scale image features and positional encoders for point features—which are then fused within multiple Dual-Stream Transformer Gaussian (DSTG) blocks and passed through an MLP to predict the attributes of a canonical 3D Gaussian avatar. Finally, these generated Gaussian points are animated into per-frame poses using Linear Blend Skinning (LBS) driven by AniMoFormer’s motion parameters, yielding the final complete 4D reconstruction output. For more details, please refer our paper.
@article{lyu20264dequine,
title={4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video},
author={Lyu, Jin and An, Liang and Cheng, Pujin and Liu, Yebin and Tang, Xiaoying},
journal={},
year={2026}
}
Thanks to RomanHauksson for the website template.