ANUBIS: Representation-Centric Survey of Supervised Skeletal Action Recognition and the New Benchmark

Yang Liu*1,6 Jiyao Yang*2 Madhawa Perera3 Pan Ji4 Dongwoo Kim5 Min Xu6 Tianyang Wang2 Saeed Anwar7 Tom Gedeon8 Lei Wang†3,9 Zhenyue Qin†1
1Australian National University 2University of Alabama at Birmingham 3Data61/CSIRO 4OPPO US Research Center 5POSTECH 6Carnegie Mellon University 7University of Western Australia 8Curtin University 9Griffith University
*Equal contribution      †Corresponding authors

Overview

  • Representation-centric Survey: A systematic review that categorizes methods by input representation (Joint, Bone, Motion, Derived) and analyzes their roles in spatial and temporal modeling.
  • ANUBIS Dataset: A large-scale benchmark with 102 action classes and 80 viewpoints (including systematic rear-view capture), containing complex multi-person, aggressive, and socially relevant actions.
  • Comprehensive Benchmarking: Evaluation of state-of-the-art models on ANUBIS, revealing representation-dependent performance trends and highlighting the need for task-aware, semantically aligned fusion.
ANUBIS Key Contributions Overview

Evolution of skeleton-based action recognition methods from 2014 to 2025.

Dataset

102
Classes
80
Subjects
66,232
Clips
80
Views
32
Joints
Dataset Overview

Built for hard viewpoints and dense interaction

The ANUBIS dataset represents a significant advancement in skeletal action recognition benchmarking. It addresses critical limitations in existing datasets by providing comprehensive multi-viewpoint coverage including systematic rear-view perspectives.

Our dataset captures 102 distinct action classes performed by 80 subjects across 80 synchronized viewpoints, resulting in over 66,000 video clips. Using Microsoft Azure Kinect devices, we provide superior skeleton extraction accuracy with 32-joint 3D skeletal data.

Key Dataset Features

  • 80 synchronized viewpoints with systematic rear-view coverage
  • 32-joint 3D skeletal representation with precise temporal alignment
  • 102 action classes spanning social, aggressive, and independent behaviors
  • Multi-person interaction scenarios with complex dynamics
  • Microsoft Azure Kinect for superior skeleton extraction accuracy
  • Comprehensive evaluation protocols for representation learning

Dataset Examples

Each sample includes RGB video, 3D skeletal data, and depth information for comprehensive multi-modal action analysis.

Punch to Face

Multi-person • Aggressive actions
RGB Video
3D Skeleton
Depth Video

Cheers and Drinks

Multi-person • Social Interaction
RGB Video
3D Skeleton
Depth Video

Bow

Multi-person • Social Interaction
RGB Video
3D Skeleton
Depth Video

Benchmark Results

Best NTU60 X-View 97.8% Skateformer ECCV 2024
Best NTU120 X-Set 92.1% DeGCN TIP 2024
Best ANUBIS Top-1 60.33% LA-GCN TMM 2025
Best ANUBIS Top-5 86.87% LA-GCN TMM 2025

Note: While NTU60 and NTU120 results are approaching saturation, ANUBIS presents a substantially more challenging benchmark with significant room for improvement, highlighting the need for more robust skeletal action recognition methods.

Leaderboard

Representative Skeleton Action Recognition Methods

Performance across NTU60, NTU120, and ANUBIS datasets

Method Venue Features NTU60 NTU120 ANUBIS Params
(M)
GFLOPs
X-Sub X-View X-Sub X-Set Top-1 Top-5
 
STGCN AAAI'18 Joint 81.5 88.3 - - 50.25 79.96 3.4 45.23
Motif-STGCN AAAI'19 Joint 84.2 90.2 - - 55.76 83.96 1.78 27.10
 
2s-AGCN CVPR'19 Joint+Bone 88.5 95.1 - - 57.26 84.86 3.47 47.84
MS-G3D CVPR'20 Joint+Bone 91.5 96.2 86.9 88.4 54.17 82.05 3.8 62.72
GCN-NAS AAAI'20 Joint+Bone 89.4 95.7 - - 56.40 84.37 6.57 93.64
HD-GCN ICCV'23 Joint+Bone 93.4 97.2 90.1 91.6 51.33 80.96 8.8 12.74
 
RA-GCN ICIP'19 Joint+Motion 85.9 93.5 - - 41.87 73.45 10.26 135.52
 
Shift-GCN CVPR'20 Joint+Bone+Motion 90.7 96.5 85.9 87.6 26.84 57.50 0.73 6.16
Decoupling-GCN ECCV'20 Joint+Bone+Motion 90.8 96.6 86.5 88.1 52.32 80.06 3.63 32.98
CTR-GCN ICCV'21 Joint+Bone+Motion 92.4 96.8 88.9 90.4 37.90 70.40 10.07 141.68
STTFormer arxiv'22 Joint+Bone+Motion 92.3 96.5 88.3 89.2 57.77 85.30 6.6 109.08
Hyperformer arxiv'22 Joint+Bone+Motion 92.9 96.5 89.9 91.3 47.51 77.77 3.1 32.68
InfoGCN CVPR'22 Joint+Bone+Motion 93.0 97.1 89.4 90.7 46.99 76.69 1.6 19.97
BlockGCN CVPR'24 Joint+Bone+Motion 93.1 97.0 90.3 91.5 54.46 81.73 2.5 36.79
Skateformer ECCV'24 Joint+Bone+Motion 93.5☆ 97.8★ 89.8 91.4 45.02 75.43 3.8 8.93
DS-STGCN TIP'24 Joint+Bone+Motion 93.2 97.5☆ 89.4 91.2 52.43 81.96 1.4 14.09
DeGCN TIP'24 Joint+Bone+Motion 93.6★ 97.4 91.0★ 92.1★ 60.16☆ 85.63☆ 1.4 9.75
ProtoGCN CVPR'25 Joint+Bone+Motion 93.5☆ 97.5☆ 90.4 91.9☆ 47.56 78.10 4.2 29.88
LA-GCN TMM'25 Joint+Bone+Motion 93.5☆ 97.2 90.7☆ 91.8 60.33★ 86.87★ 3.4 28.32
Error Analysis

ANUBIS Dataset Confusion Matrix

LA-GCN detailed classification performance analysis

LA-GCN Confusion Matrix

What this view reveals

The confusion matrix shows the detailed classification performance of LA-GCN on the ANUBIS dataset. It highlights which action classes remain challenging and where the model tends to make classification errors.

These patterns help identify the next bottlenecks for robust skeletal action recognition under complex viewpoints and interactions.

Citation

@article{liu2026PR,
      title={Representation-Centric Survey of Supervised Skeletal Action Recognition and the New Benchmark}, 
      author={Yang Liu and Jiyao Yang and Madhawa Perera and Pan Ji and Dongwoo Kim and Min Xu and Tianyang Wang and Saeed Anwar and Tom Gedeon and Lei Wang and Zhenyue Qin},
      journal={Pattern Recognition},
      year={2026},
}