AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer


Jin Lyu *,1 , Tianyi Zhu *,2 , Yi Gu 3 , Li Lin 1,4 , Pujin Cheng 1,4 , Yebin Liu 5 , Xiaoying Tang †,1 , Liang An †,5
1 Southern University of Science and Technology
2 China Mobile Communications Company Limited Research Institute
3 The Hong Kong University of Science and Technology
4 The University of Hong Kong
5 Tsinghua University

*Equal Contribution, Corresponding Author

Qualitative Results

Abstact

Quantitative analysis of animal behavior and biomechanics requires accurate animal pose and shape estimation across species, and is important for animal welfare and biological research. However, the small network capacity of previous methods and limited multi-species dataset leave this problem underexplored. To this end, this paper presents AniMer to estimate animal pose and shape using family aware Transformer, enhancing the reconstruction accuracy of diverse quadrupedal families. A key insight of AniMer is its integration of a high-capacity Transformer-based backbone and an animal family supervised contrastive learning scheme, unifying the discriminative understanding of various quadrupedal shapes within a single framework. For effective training, we aggregate most available open-sourced quadrupedal datasets, either with 3D or 2D labels. To improve the diversity of 3D labeled data, we introduce CtrlAni3D, a novel large-scale synthetic dataset created through a new diffusion-based conditional image generation pipeline. CtrlAni3D consists of about 10k images with pixel-aligned SMAL labels. In total, we obtain 41.3k annotated images for training and validation. Consequently, the combination of a family aware Transformer network and an expansive dataset enables AniMer to outperform existing methods not only on 3D datasets like Animal3D and CtrlAni3D, but also on out-of-distribution Animal Kingdom dataset. Ablation studies further demonstrate the effectiveness of our network design and CtrlAni3D in enhancing the performance of AniMer for in-the-wild applications.

AniMer

The goal of AniMer is to estimate the pose and shape of animal from a single image. Given an Image IRH×W×3I \in \mathbb{R}^{H \times W \times 3}, we first utilize a ViT encoder to extract image feature tokens FR192×1280\mathbf{F} \in \mathbb{R}^{192 \times 1280}, while the class token interacts with the image to capture information about animal family. Then, we feed the feature token F\mathbf{F} into SMAL Transformer Decoder to obtain a feature vector fR1×1280\boldsymbol{f} \in \mathbb{R}^{1 \times 1280}. Finally, the regression head are used to regress the shape parameter β^\hat{\beta}, pose parameter θ^\hat{\theta} and camera parameter π^\hat{\pi}. At the same time, the class token is fed into the predictor head for animal family supervised contrastive learning. For more details, please refer our paper.

CtrlAni3D Generation Pipeline

The generation pipeline of CtrlAni3D dataset contains three parts: (a) Text prompt generation. (b) Conditional image generation. (c) Image generation and post-processing. For more details, please refer our paper.

CtrlAni3D Dataset Samples

BibTeX citation

    @misc{lyu2024animeranimalposeshape,
      title={AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer}, 
      author={Jin Lyu and Tianyi Zhu and Yi Gu and Li Lin and Pujin Cheng and Yebin Liu and Xiaoying Tang and Liang An},
      year={2024},
      eprint={2412.00837},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.00837}, 
}