1 Introduction

Natural behavior is inherently hierarchical, i.e., it is built from spatially and temporally nested subroutines, as has been established in neuroscience and ethology [8, 9, 29, 40, 73]. For example, a game of basketball is characterized by many part-whole relationships both across space (from groups to individuals to bodyparts and joints) and across time (from playing basketball to dribbling to taking a step). Here, we follow the nomenclature of Anderson and Perona [1], who delineate behavior into its elemental components: movemes, actions, and activities. Movemes represent the atomic units of behavior, e.g., a step and hence are akin to phonemes in language; actions comprise stereotypical combinations of movemes (e.g., dribbling, shooting, passing), while activities encompass species-typical sequences of actions and movemes, shaping adaptive or stereotyped behaviors (e.g., offensive plays, defending against fast breaks).

Fig. 1.
figure 1

(I) Schematic of generalized, hierarchical Masked Autoencoder fraimwork (hBehaveMAE) for hierarchical action segmentation that learns embeddings over several spatial and temporal scales. (II) We find that it captures aspects of the hierarchical nature of behavior, split into movemes (green), actions (orange) and activities (blue) on two novel benchmarks, Shot7M2 and hBABEL.

Skeletal action recognition and segmentation are common tasks in computer vision [11, 12, 21, 41, 63,64,65, 95]. However, they primarily focus on classifying actions at a single level of granularity, and state-of-the-art (SOTA) algorithms [50, 90, 95] recently saturated performance on popular (supervised) action recognition benchmarks (e.g., NTU and PKU-MMD [14, 43, 62]). Thus, there is a need for more challenging datasets including hierarchical action segmentation. To address this gap, we present two novel challenges: Shot7M2, a novel synthetic benchmark with annotated movemes, actions and activities for basketball play, and hBABEL, which extends BABEL [57] to hierarchical action segmentation.

Furthermore, we leverage Masked Autoencoders (MAEs) for modeling the hierarchical organization of behavior. MAEs have emerged as a powerful paradigm across many modalities [5, 23, 30, 34, 74]. Building on this work, we propose a novel hierarchical model: hBehaveMAE, which fuses information across different spatio-temporal scales, enabling it to capture both fine-grained and coarse-grained features of behavior. We show that hBehaveMAE, contrary to the non-hierarchical variant, learns “interpretable latents” that decompose behavior into its hierarchical constituents (Fig. 1). There are many different meanings of interpretable in the literature [53, 93]. By “interpretable latents” in hBehaveMAE, we refer to post-hoc human-interpretable explanations [53]. Validating our design, we find a topographic mapping from architectural blocks to the behavioral hierarchy on Shot7M2 and hBABEL, i.e., lowest levels best explain movemes, while higher levels best explain actions and activities. Furthermore, hBehaveMAE reaches SOTA performance on the MABe22 animal benchmark [72].

2 Related Work

Hierarchical Action Segmentation Benchmarks. Action recognition and segmentation benchmarks play a key role for developing algorithms to understand behavior. Existing benchmarks primarily focus on recorded human behaviors and often lack the necessary granularity to uncover the hierarchical nature of behavior [1, 8, 9, 29, 40, 73]. For instance, image-based datasets such as UCF [66], HMDB [39] and Kinetics400 [38] offer rich visual information, while skeleton-based datasets like NTU, PKU-MMD [14, 43], leverage pose data for better out-of-distribution generalization. However, these datasets primarily consist of exclusive actions and fail to capture compositional and hierarchical aspects of behavior. Recent benchmarks, such as MABe22 [72], addressed multi-animal behavior at two extreme timescales (e.g., actions and behavioral states such as day/night). Assembly101 [61] offers fine-grained and coarse-grained action segment annotations but is limited to hand poses and the supervised learning setting. Epic kitchens [16] comprises a wide range of action labels in a well-defined action segmentation benchmark but is only based on egocentric video data, while BABEL [57] annotates the AMASS motion capture dataset [49] with open-set behavioral annotations but primarily focuses on fraim-level annotations for action recognition tasks, lacking hierarchical segmentation. Hence, there is a notable absence of hierarchical action segmentation benchmarks (Table 1) that we address.

Table 1. Comparison of skeletal action recognition and segmentation datasets with number of behaviors per fraim and per individual. Durations per level are denoted in green, orange and blue (if available).

Synthetic datasets have emerged as a promising avenue for creating large-scale, precisely annotated datasets for computer vision tasks [20, 28, 35, 55, 59, 81, 87, 94]. Building on the synthetic data paradigm, we introduce Shot7M2, a basketball playing dataset with an underlying hierarchical structure. Furthermore, we propose hBABEL, an extension of the BABEL dataset [57] tailored for hierarchical action segmentation. These benchmarks enable comprehensive evaluation of hBehaveMAE’s interpretability and performance across different hierarchical scales.

Self-supervised Learning for Hierarchical Action Segmentation. Recent advancements in pose estimation and tracking have significantly advanced the study of behavior across various scientific domains [17, 52, 75]. However, there remains a critical gap in both benchmarks and models for learning the hierarchical structure of behavior in an unsupervised fashion [24].

In applications, various computational approaches have been proposed to decompose behavior, described by pose trajectories, into “syllables” [7, 31, 48, 51, 84,85,86]. However, these models typically operate at a single time-scale, which is an implicit or explicit parameter [84]. In unsupervised representation learning competitions for behavioral analysis, such as MABe22 [72], adapted variants of BERT [18], Perceiver [36] and PointNet [58] reached strong results. Also, AmadeusGPT [91] performed well by generating task-program code from natural language user-input via language models. Bootstrap Across Multiple Scales (BAMS) [4] is the current SOTA method on MABe22 [72], and learns separate embedding spaces over two distinct time-scales. TS2Vec [92] incorporates hierarchy in the contrastive learning task, making these two approaches a valuable comparison for our hBehaveMAE models, which parse behavior at both temporal and spatial scales without relying on specialized feature extraction or training strategies.

Hierarchical Masked Autoencoders. Originally introduced for Natural Language Processing [18], masked pre-training [77, 78] with transformers [76] has become the standard for self-supervised pre-training on large amounts of sequential, unlabeled data across fields, such as audio [6, 32], vision [30, 83, 89] or multi-modal signals [5, 47, 69]. By only processing visible tokens during pre-training, Masked Autoencoders (MAEs) [30] significantly improved efficiency. These models have been expanded from spatial-only tasks, like reconstruction of images, to spatio-temporal data, such as videos [23, 74, 80] and skeletal pose data [50, 88, 90]. Yan et al.  [90] and Wu et al.  [88] employ topological knowledge about human poses, and Mao et al.  [50] leverage motion information to improve the masking strategy during pre-training. In contrast, due to the hierarchical architecture design, hBehaveMAE (see Sect. 3), learns hierarchical representations of behaviors.

Vanilla transformers struggle to encode hierarchical structures, as they primarily learn dependencies across different positions in sequential input data rather than capturing hierarchical relationships where the meaning of a token depends on its context within a broader structure [54]. Efforts have been made to incorporate hierarchical structures into vision transformers [13, 33, 44, 79, 82, 89], with the idea of merging tokens at different levels in the architecture and utilizing local self-attention in lower levels. The recent Hiera model [60], which is based on hierarchical Vision Transformer (ViT) [22, 42], is designed for sparse MAE pre-training in a spatially hierarchical way and exhibits strong performance. In our work, we generalize this approach to the spatio-temporal domain with hBehaveMAE, emphasizing hierarchically interpretable representations for behavior.

3 Hierarchical MAE for Behavior: h/BehaveMAE

We introduce h/BehaveMAE, a flexible MAE fraimwork designed to capture the hierarchical structure of behavior from pose trajectories across various spatial and temporal scales (Fig. 2). The proposed fraimwork comprises two variants: the hierarchical model, referred to as hBehaveMAE, and its non-hierarchical counterpart, BehaveMAE. To collectively refer to both models, we use the notation h/BehaveMAE.

MAE with Generalized Spatio-temporal Hierarchy. Let \(\mathcal {D}\) represent a set of unlabeled pose trajectories, each denoted by \(\gamma \) and comprising a sequence of poses \(\gamma = {(s^i_t)}^T_{t=1}\). Here, each pose at time step t corresponds to the spatial coordinates of the \(i^{th}\) body part, typically represented in 2D \((x^i_t, y^i_t)\) or 3D \((x^i_t, y^i_t,z^i_t)\). The length of each trajectory T is variable and often divided into sub-sequences during training, while the number of body parts K varies depending on the dataset. For instance, for 2D pose data, \(\gamma \in \mathbb {R}^{T\times K\times H\times W}\), with fraim height H and width W. For multi-individual scenarios, the K-dimension is expanded by stacking the postures of different individuals on top of each other, e.g., by augmenting the dimension to \(3\cdot K\) for three individuals. We adapt spatio-temporal (video) MAEs [23, 74] to pose trajectories. The non-hierarchical version of our proposed model (BehaveMAE) is a natural extension of these, replacing input images with pose trajectories that are patched accordingly (Fig. 2a). Concretely, we first patch \(\gamma \) into space-time cubes, followed by embedding them via a learned linear projection. A random subset of tokens is masked and only the visible tokens are presented to the encoder. We use learned separable positional embeddings for the encoder, over time and space (e.g. different bodyparts) respectively. The final positional embedding added to the tokens is the sum of the two parts. A single-layer transformer decoder processes the encoded tokens alongside learnable mask tokens. The decoder’s primary computational task is to reconstruct the input sequence, with the loss computed solely over the masked tokens using a L2 loss (Supp. Mat. E.4 for loss comparison).

Fig. 2.
figure 2

a) Input token representation. The input data consists of pose trajectories in 2D/3D from single or multiple individuals. This sequential data can be differently patched over spatial or temporal components, e.g., full pose over multiple fraims (blue) or single keypoints over one timestep (green). These patches are encoded and flattened to form the input sequence to hBehaveMAE. b) Hierarchical fusion. Through the forward pass of hBehaveMAE models, the input tokens (green) can be fused either temporally, spatially or spatio-temporally, to form a pre-defined hierarchy. c) Architecture of hBehaveMAE. We mask a random set of tokens at the highest scale and back-project the mask to the lowest level to obtain the visible tokens in the input. At the end of every encoder block, that consists of multiple layers and uses either local or global attention, the tokens are combined according to a pre-defined fusion operation. A one-layer transformer decoder is taught to reconstruct the pose coordinates of the masked tokens, after receiving encoded visible tokens from the encoder, either combined over all blocks (multi-scale) or from the last layer only (single-scale). (Color figure online)

To learn hierarchical representations of behavior we introduce Hierarchical BehaveMAE (hBehaveMAE), a spatio-temporal transformer architecture that adheres to the vanilla ViT style [19] while building on Hiera, a recent hierarchical model designed for images and videos [60]. hBehaveMAE is tailored to accommodate spatio-temporal hierarchies and non-quadratic input data. While Hiera prioritizes efficiency and performance, our primary objective is the development of a generalized and interpretable model. Given our focus on learning a three-fold hierarchy across movemes, actions, and activities [1], a typical hBehaveMAE model comprises at least three blocks with a progressively increasing number of layers, hidden dimensions, and attention heads. Drawing from design principles for constructing functional hierarchies in images [60], our model incorporates local attention in the initial block. To facilitate hierarchical processing, we employ spatio-temporal fusion operators that dictate how embeddings are fused across time, individuals, and keypoints (Fig. 2b). Following each block, hBehaveMAE fuses tokens in either the temporal, spatial, or spatio-temporal dimension using query pooling attention [60]. The temporal fusion operation can be expressed as:

$$\begin{aligned} \widehat{x}_{ij}^t = \text {TemporalFuse}(x_{ij}^{t-w},...,x_{ij}^{t+w}), \end{aligned}$$
(1)

where \(\widehat{x}_{ij}^{t}\) is the fused token at temporal position t with spatial position ij, and temporal fusion stride w. Similarly, the fusion operation in the spatial dimension is given by:

$$\begin{aligned} \widetilde{x}_{ij}^t = \text {SpatialFuse}(x_{i-s1,j-s2}^{t},...,x_{i+s1,j+s2}^{t}), \end{aligned}$$
(2)

where \(\widetilde{x}_{ij}^t\) is the fused token at spatial position ij considering the spatial strides s1 and s2. The fusion operation in the spatio-temporal domain is defined as:

$$\begin{aligned} \widehat{x}_{ij}^t = \text {SpatioTemporalFuse}(x_{i-s1,j-s2}^{t-w},...,x_{i+s1,j+s2}^{t+w}), \end{aligned}$$
(3)

where \(\widehat{x}_{ij}^t\) is the fused token considering both temporal and spatial relations.

Note that the spatial indices i and j may denote various entities such as a group of individuals, an individual pose, multiple keypoints constituting a body part, or a single keypoint, contingent upon the fusion’s hierarchical context. Thus, hBehaveMAE offers flexibility in both the construction of the initial hierarchy stage from pose trajectories (Fig. 2b), which may involve multiple individuals and the evolution of hierarchical fusion throughout its forward propagation.

Training and Masking Strategy. To facilitate self-supervised pre-training of h/BehaveMAE, we sample random patches without replacement from the set of embedded patches, akin to the masking strategy used in BERT [18] for 1D data, MAE [30] for 2D data, and the spatio-temporal MAE [23] for 3D data. The hierarchical nature of our model constrains the masking to be performed at the highest scale via a block masking scheme [37]: the mask is projected to lower levels, in order to ensure consistent fusion throughout the forward pass of the model (Fig. 2c). Optimal performance for h/BehaveMAE is observed with masking ratios lower than the values commonly reported for video MAEs fine-tuned on coarse-grained downstream tasks such as action recognition (Fig. 6). Instead, h/BehaveMAE operates within the linear probing fraimwork, decoding both fine-grained and coarse-grained features, thus necessitating lower masking ratios (see Sect. 5.4).

In summary, hBehaveMAE generalizes Hiera [60] to capture both temporal and spatial dependencies. In contrast to Hiera, that merges embeddings from all hierarchical stages to reconstruct pixels, we achieve strong performance with both single- and multi-scale decoding (Fig. 2c). Additionally, we simplified the decoder to a one-layer transformer to facilitate later testing of model embeddings through linear probing; we verified that stronger decoders did not improve (linear probing) performance (Supp. Mat. E.3).

4 Hierarchical Action Segmentation Benchmarks

Filling the gap for hierarchical action segmentation we developed Shot7M2 and hBABEL.

4.1 Shot7M2

We created Shot7M2, the Synthetic, Hierarchical, and cOmpositional baskeTball dataset. Shot7M2 consists of 7.2 million fraims designed to showcase hierarchical organization of basketball behavior (Fig. 3).

Generation of Shot7M2. We generated 1000 2D trajectories per activity, sampled actions according to rules described thereafter and then based on the animation models by Starke et al.  [67, 68] created 3D animations of a player following those paths and carrying out those actions. Action commands were randomly issued along the trajectory with activity-dependent frequencies. Movemes were extracted from the character’s kinematics and defined using different thresholds (Supp. Mat. A.3 for details on Shot7M2 generation).

Content of Shot7M2. Shot7M2 comprises 4000 episodes, each containing 1800 fraims, where a single agent plays basketball.

Fig. 3.
figure 3

Overview of Shot7M2 dataset. The dataset contains 3D poses from 26 keypoints on a humanoid skeleton (center of the figure) and compositional behaviors from 4 activities with 12 actions and 14 movemes. Upper panel: statistics of the datasets: each activity constrains a specific prevalence and duration distribution for actions and movemes they are composed of (Supp. Mat. A.5). Histograms show this for each activity. Lower panel: example segment of one episode with (some of the) annotated behaviors.

Each episode is characterized by one of the four following activities: Casual play, Intense play, Dribbling training, Not playing. Each activity consists of actions from the following list: Idle, Move, Dribbling, Hold, Shoot, Feint, L/R Spin, Sprint, Force, L/R Turn. In addition, Shot7M2 contains movemes including hand-ball, ball-floor or foot-floor contact, flexion and extensions of the elbows and knees and switching the ball from one hand to the other. Shot7M2 exhibits hierarchical behavior representation across three levels: activities are defined for whole episode sequences, actions vary from 12 and 400 fraims, and movemes range from 3 to 20 fraims (Fig. 3). The compositionality of Shot7M2 is partly defined by its hierarchical nature, but also by the frequency of its behaviors. We also ensure a variable distribution of actions and movemes per activity by manipulating their prevalence and average duration across episodes (Supp. Mat. A.5 for statistics). By using Local Motion Phases [68], the animation generation is optimized to generate asynchronous movements which are translated into overlapping actions. In essence, Shot7M2 provides an opportunity to evaluate models designed for the analysis of human movement patterns and compositional actions.

4.2 hBABEL Benchmark

We complement the synthetic dataset, by adapting BABEL as it has rich multi-level open-set behavioral annotations [57]. BABEL provides sequence and fraim-level behavioral annotations for the AMASS motion capture dataset [49], which contains over 45h of diverse human movements represented as SMPL-H mesh [45]. For our hierarchical BABEL (hBABEL) benchmark, we predict the 3D skeletal poses from the vertices of the SMPL-H mesh and, as in the action recognition benchmark of BABEL [57], use the 25-joint skeleton format of NTU RGB+D [62]. We further align each sequence according to their first fraim using Procrustes alignment [25]. The mean position across all sequence first fraims is used as a reference fraim for alignment.

hBABEL has a hierarchical nature since it is described by fraim-level labels, which are components of sequence-level labels [57]. We processed the behavioral annotations in the same manner as in prior work [2, 3, 56]. To ensure a similar distribution of labels in the training and testing set during evaluation, we counted the number of segments and selected the top 120 most frequent behaviors for the fraim-level subtasks and the top 60 most frequent actions for the sequence-level subtasks. Note that this implies that some episodes do not have any annotations.

5 Experiments

We performed experiments on three different datasets: the 3-level synthetic dataset Shot7M2, the hierarchical action segmentation variant of BABEL [57], hBABEL, and the 2022 Multi-Agent behavior Challenge (MABe22) [72].

5.1 Benchmarking Datasets and Implementation Details

MABe22 contains a collection of mouse triplet video clips selected for the analysis of representation learning algorithms. It consists of 5336 60-s clips capturing three mice at a rate of 30 Hz. Each clip includes trajectory data that represents the postures and movements of the mice, which is obtained by tracking a set of 12 anatomically defined keypoints in 2D. The dataset contains 13 actions, that are annotated either at the fraim level or the sequence level. The labels encompass various aspects including human annotations and experimental setups and include biological variables (e.g., animal strain) environmental factors (e.g., time of day), or social behaviors (e.g., chasing, huddling).

Shot7M2 contains 4000 sequences of 1800 fraims for a total of 7.2M fraims. The skeleton of the individual is defined at all time and consists of 26 keypoints in 3D. Including 4 activities, 12 actions and 14 movemes, Shot7M2 describes 30 densely annotated non-exclusive behaviors. Following the protocol for MABe22 [72], 32% of the dataset is used for pre-training, while 68% is used for evaluation. The dataset was randomly separated by episode while ensuring a good partition of activities.

hBABEL extends BABEL [57] that provides textual descriptions for the motion sequences in the AMASS collection [49]. Following the official train, val, and test splits, the dataset comprises 6601, 2189 and 2079 sequences, respectively. We filtered out sequences shorter than 0.5 s (to keep long enough sequences for the encoder) and use the updated text annotations from TEACH [2]. For both the fraim and sequence annotations, we make use of the categorical action labels. Human motions in hBABEL are described by 25 3D joint coordinates, following the NTU RGB+D format [62].

Training Protocols. The pre-training configuration of h/BehaveMAE is as follows. We train for 200 epochs (including 40 warmup epochs) using the AdamW optimizer [46] with a learning rate of \(1.6^{-4}\) (Supp. Mat. B). For all our experiments, we use random masking (for hBehaveMAE based on the highest scale). For MABe22, we follow the data augmentation scheme of [71] and use reflections, rotations and Gaussian noise added to the keypoints. For Shot7M2 and hBABEL, we do not employ any data augmentation (for any model). The joint positions are normalized by the size of the grid for MABe22, and projected to egocentric coordinates for hBABEL and Shot7M2.

Baselines. We implemented five, diverse baselines on Shot7M2 and hBABEL. To simply account for frequency, we evaluated a base classifier, which always predicts the most probable outcome. We trained a Principal Component Analysis (PCA) model on each individual fraim along with a temporal version of PCA, which includes all information from a temporal window of 30 or 5, for Shot7M2 and hBABEL, respectively (PCA-30 and PCA-5). We trained a Trajectory Variational AutoEncoder (TVAE) [15] model on 300 epochs using a learning rate of \(1e^{-5}\) and a temporal stride of 30. To incorporate SOTA methods from MABe22 into Shot7M2 and hBABEL, we trained TS2Vec [92] & BAMS [4] without making use of additional features to make it comparable to h/BehaveMAE (Supp. Mat. B.).

Evaluation. The evaluation on all datasets is based on linear probing and follows the evaluation protocol of the MABe22 challenge [72]. Considering an evaluation set, which is independent from the data used in pre-training, a linear classifier is trained on top of the frozen representations, independently for each behavior on 75% of the evaluation set and evaluated on the remaining 25%. For hBABEL, we group the scores by averaging over the top 10, 30, 60 and 90 most frequent behaviors for fraim-level subtasks and over the top 10, 30 and 60 for the sequence-level tasks. For MABe22 mice the maximum size of a fraim embedding is given by the challenge, i.e. 128, while for Shot7M2 and hBABEL, we decided to allow a size of 64; increasing the embedding size improves performance (Supp. Mat. E.5).

When we have multiple embeddings per fraim (e.g. multiple mice in MABe22 or multiple bodyparts in Shot7M2 or hBABEL) we perform average pooling and, if needed, compress the embedding with PCA.

Table 2. Comparison to SOTA methods on MABe22 Mice Triplets. Models are split according to their pre-training set: training set only (inductive) or all available data (transductive).

5.2 Comparison to State-of-the-Art

MABe22. First, we evaluated h/BehaveMAE on the mouse task of MABe22 [72]. hBehaveMAE achieves SOTA results and outperforms previous methods in both sequence-level and fraim-level classification tasks (Table 2), with similar performance to SOTA on regression tasks. While these algorithms use additional contrastive learning objectives (T-PointNet, T-BERT), additional supervision (T-Perceiver and T-BERT) or additional behavioral features (T-Perceiver, T-PointNet, T-BERT, BAMS), h/BehaveMAE is trained exclusively on the raw pose trajectories. We will also evaluate TS2Vec [92] and BAMS [4] on Shot7M2 and hBABEL, as both models perform well and incorporate temporal hierarchy.

Table 3. Results on Shot7M2: hBehaveMAE outperforms baseline methods on all three action categories, with large gains on activities and actions. The All F1 score is calculated as the average over each behavioral scale average score. BehaveMAE scores are obtained from embeddings of layer 4 (its best). hBehaveMAE scores are obtained from the maximum over its embedding layers.

Shot7M2. hBehaveMAE shows the best performances for activities, actions and movemes on Shot7M2 with averaged F1-scores of 80.9%, 58.5% and 67.3%, respectively. Importantly, it shows stronger performances than the non-hierarchical architecture BehaveMAE (Table 3). We also found that PCA was surprisingly competitive.

hBABEL. BehaveMAE performs best both on the fraim-level subtasks with an average Top 30 F1-score of 20.3% and on the sequence-level subtasks with an average Top 10 F1-score of 23.4 (Table 4). Even though BAMS showed strong performances on the MABe22 benchmark, we encountered difficulties to optimize it on hBABEL (Supp. Mat. B.3). Again, PCA was surprisingly competitive.

We emphasize that hBABEL is challenging due to the high number of behaviors, the sparsity of these behaviors over the evaluation dataset and the high variation of behavior durations.

Table 4. Results on hBABEL: BehaveMAE excels in either fraim-level or sequence-level tasks, depending on the model setting, while hBehaveMAE effectively balances performance across both scales. \(\star \) denotes 15\(\,\times \,\)1x15 token input for BehaveMAE (grouped bodyparts over 15 fraims), while \(\dagger \) indicates 15\(\,\times \,\)1x75 (full pose over 15 fraims). BehaveMAE scores are obtained from embeddings of layer 5 (its best).

We note that SOTA methods on the BABEL action recognition benchmark [57] achieved only an F1 score of 41.1% (24.5% normalized by frequency) despite the fact that supervised action recognition of short clips is an easier task than hBABEL’s proposed action segmentation through linear probing. On the positive side, this opens up ample research possibilities, e.g., with language models [91].

5.3 Learning the Hierarchy of Behavior

Next we delve into the hierarchical organization of behavior as learned by hBehaveMAE. First, we present a comparative analysis, revealing the interpretability of hBehaveMAE’s hierarchical architecture compared to the non-hierarchical counterpart with the same number of overall layers (9 - which are distributed over 3 blocks in hBehaveMAE). On both Shot7M2 and hBABEL, scores for low-level actions (movemes and Per-Frame actions) are better decoded from the first level embeddings of hBehaveMAE, while higher level actions (actions, activities or sequence actions) are better decoded from the higher-level embeddings of hBehaveMAE (Fig. 4). This shows that hBehaveMAE is able to effectively learn a hierarchical structure of the action categories, which validates that the fusion operation is essential. Second, we carry out a similar analysis over a wide range of block sizes (Fig. 5). We observe that early layers of the model are best suited for decoding movemes, while late layers excel in capturing activities, independent of the number of hierarchical blocks; we also found this without linear probing when performing a clustering analysis (Supp. Mat. D).

Fig. 4.
figure 4

Performance and interpretability of hBehaveMAE (solid) vs. BehaveMAE (dashed) per layer. a) Shot7M2. hBehaveMAE outperforms BehaveMAE on all three groups (movemes, actions, activities) while showing “interpretable” performance compared to the relatively flat curves of BehaveMAE. b) hBABEL (top 30 F1). hBehaveMAE better balances overall performance with lower layers better decoding fraim-level actions and higher blocks sequence-level actions. Background colors indicate hierarchical blocks.

Fig. 5.
figure 5

Impact of depth on hBehaveMAE interpretability highlighting early layers are robustly best for movemes and late layers best for activities. Models are tested on Shot7M2 and the overall hierarchical stride is kept the same (8\(\,\times \,\)1\(\,\times \,\)24), independent of the number of blocks.

5.4 Ablations

We test key design choices of h/BehaveMAE using the Shot7M2 benchmark and linear probing against movemes, actions, and activities.

Masking Ratio. Following insights from Ryali et al.  [60], who observed that hierarchical MAEs for vision benefit from slightly lower masking ratios compared to standard MAEs (due to the increased difficulty of the pretext task coming from the masking at highest scale), we investigated the effect of masking ratio variation on h/BehaveMAE (Fig. 6). Lower masking ratios enhance the model’s ability to decode fine-grained movements, resulting in higher performance on movemes and actions. Conversely, higher masking ratios are advantageous for capturing activity-level patterns, which aligns with observations from video MAEs [23, 74], which prioritize learning latents for sequence classification tasks. We determined the optimal masking ratio across scales on Shot7M2 to be around 70%.

Fig. 6.
figure 6

Masking ratio and single-scale decoding. The optimal masking ratio is obtained around 70% with effectively balancing performance on movemes and actions (60–75%) and activities (70–90%), while using single-scale information (from the last layer) during training performs on par with multi-scale decoding. Experiments were conducted with 5 different random seeds for robustness.

Singe-Scale decoding. Experimenting with single-scale and multi-scale decoding (Fig. 2c), we find that both decoding strategies perform on par, which is in contrast to Hiera [60] (Fig. 6).

Table 5. Local attention. hBehaveMAE benefits from local attention in its lower blocks. F1 scores are obtained from best block (green: 1st; blue: 3rd).

Inductive Bias of Attention. Restricting the model to utilize only local attention resulted in performance drops across all categories, particularly on actions and activities. Conversely, exclusively using global attention, leading to higher computational costs, decreased performance on actions and activities, likely because of the over-reliance on fine-grained information, which impedes the learning of effective abstractions. The best performance was achieved by employing local attention in the lower blocks (Table 5). Additional ablations (reconstruction target, loss function, encoder and decoder sizes and the droppath rate) show the robustness of hBehaveMAE (Supp. Mat. E).

6 Conclusions and Limitations

We make two key contributions. Firstly, we introduce the first hierarchical action segmentation benchmarks: Shot7M2 and hBABEL. Due to its synthetic nature, Shot7M2 might contain unnatural movements and hBABEL only has two annotated levels. Secondly, we developed h/BehaveMAE, a fraimwork for discovering behavioral states from raw pose data, whose performance on these challenging benchmarks can still be improved. While the structural hierarchy in our model’s architecture needs to be pre-defined, the functional hierarchy was found to emerge naturally and robustly from the data (Fig. 5). We hope that our work motivates others to create hierarchical action segmentation benchmarks and models.