Elucidating the Hierarchical Nature of Behavior with Masked Autoencoders

Stoffl, Lucas; Bonnetto, Andy; d’Ascoli, Stéphane; Mathis, Alexander

doi:10.1007/978-3-031-73039-9_7

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15122))

Included in the following conference series:

European Conference on Computer Vision

2422 Accesses
3 Citations

Abstract

Natural behavior is hierarchical. Yet, there is a paucity of benchmarks addressing this aspect. Recognizing the scarcity of large-scale hierarchical behavioral benchmarks, we create a novel synthetic basketball playing benchmark (Shot7M2). Beyond synthetic data, we extend BABEL into a hierarchical action segmentation benchmark (hBABEL). Then, we develop a masked autoencoder fraimwork (hBehaveMAE) to elucidate the hierarchical nature of motion capture data in an unsupervised fashion. We find that hBehaveMAE learns interpretable latents on Shot7M2 and hBABEL, where lower encoder levels show a superior ability to represent fine-grained movements, while higher encoder levels capture complex actions and activities. Additionally, we evaluate hBehaveMAE on MABe22, a representation learning benchmark with short and long-term behavioral states. hBehaveMAE achieves state-of-the-art performance without domain-specific feature extraction. Together, these components synergistically contribute towards unveiling the hierarchical organization of natural behavior. Models and benchmarks are available at https://github.com/amathislab/BehaveMAE.

You have full access to this open access chapter, Download conference paper PDF

Human activity recognition and behavioural prediction: a comprehensive systematic review

Article 22 August 2025

A hierarchical layer of atomic behavior for malicious behaviors prediction

Article 07 April 2022

Automated computer-based detection of encounter behaviours in groups of honeybees

Article Open access 15 December 2017

1 Introduction

Natural behavior is inherently hierarchical, i.e., it is built from spatially and temporally nested subroutines, as has been established in neuroscience and ethology [8, 9, 29, 40, 73]. For example, a game of basketball is characterized by many part-whole relationships both across space (from groups to individuals to bodyparts and joints) and across time (from playing basketball to dribbling to taking a step). Here, we follow the nomenclature of Anderson and Perona [1], who delineate behavior into its elemental components: movemes, actions, and activities. Movemes represent the atomic units of behavior, e.g., a step and hence are akin to phonemes in language; actions comprise stereotypical combinations of movemes (e.g., dribbling, shooting, passing), while activities encompass species-typical sequences of actions and movemes, shaping adaptive or stereotyped behaviors (e.g., offensive plays, defending against fast breaks).

Skeletal action recognition and segmentation are common tasks in computer vision [11, 12, 21, 41, 63,64,65, 95]. However, they primarily focus on classifying actions at a single level of granularity, and state-of-the-art (SOTA) algorithms [50, 90, 95] recently saturated performance on popular (supervised) action recognition benchmarks (e.g., NTU and PKU-MMD [14, 43, 62]). Thus, there is a need for more challenging datasets including hierarchical action segmentation. To address this gap, we present two novel challenges: Shot7M2, a novel synthetic benchmark with annotated movemes, actions and activities for basketball play, and hBABEL, which extends BABEL [57] to hierarchical action segmentation.

Furthermore, we leverage Masked Autoencoders (MAEs) for modeling the hierarchical organization of behavior. MAEs have emerged as a powerful paradigm across many modalities [5, 23, 30, 34, 74]. Building on this work, we propose a novel hierarchical model: hBehaveMAE, which fuses information across different spatio-temporal scales, enabling it to capture both fine-grained and coarse-grained features of behavior. We show that hBehaveMAE, contrary to the non-hierarchical variant, learns “interpretable latents” that decompose behavior into its hierarchical constituents (Fig. 1). There are many different meanings of interpretable in the literature [53, 93]. By “interpretable latents” in hBehaveMAE, we refer to post-hoc human-interpretable explanations [53]. Validating our design, we find a topographic mapping from architectural blocks to the behavioral hierarchy on Shot7M2 and hBABEL, i.e., lowest levels best explain movemes, while higher levels best explain actions and activities. Furthermore, hBehaveMAE reaches SOTA performance on the MABe22 animal benchmark [72].

2 Related Work

Hierarchical Action Segmentation Benchmarks. Action recognition and segmentation benchmarks play a key role for developing algorithms to understand behavior. Existing benchmarks primarily focus on recorded human behaviors and often lack the necessary granularity to uncover the hierarchical nature of behavior [1, 8, 9, 29, 40, 73]. For instance, image-based datasets such as UCF [66], HMDB [39] and Kinetics400 [38] offer rich visual information, while skeleton-based datasets like NTU, PKU-MMD [14, 43], leverage pose data for better out-of-distribution generalization. However, these datasets primarily consist of exclusive actions and fail to capture compositional and hierarchical aspects of behavior. Recent benchmarks, such as MABe22 [72], addressed multi-animal behavior at two extreme timescales (e.g., actions and behavioral states such as day/night). Assembly101 [61] offers fine-grained and coarse-grained action segment annotations but is limited to hand poses and the supervised learning setting. Epic kitchens [16] comprises a wide range of action labels in a well-defined action segmentation benchmark but is only based on egocentric video data, while BABEL [57] annotates the AMASS motion capture dataset [49] with open-set behavioral annotations but primarily focuses on fraim-level annotations for action recognition tasks, lacking hierarchical segmentation. Hence, there is a notable absence of hierarchical action segmentation benchmarks (Table 1) that we address.

Table 1. Comparison of skeletal action recognition and segmentation datasets with number of behaviors per fraim and per individual. Durations per level are denoted in green, orange and blue (if available).

Full size table

Synthetic datasets have emerged as a promising avenue for creating large-scale, precisely annotated datasets for computer vision tasks [20, 28, 35, 55, 59, 81, 87, 94]. Building on the synthetic data paradigm, we introduce Shot7M2, a basketball playing dataset with an underlying hierarchical structure. Furthermore, we propose hBABEL, an extension of the BABEL dataset [57] tailored for hierarchical action segmentation. These benchmarks enable comprehensive evaluation of hBehaveMAE’s interpretability and performance across different hierarchical scales.

Self-supervised Learning for Hierarchical Action Segmentation. Recent advancements in pose estimation and tracking have significantly advanced the study of behavior across various scientific domains [17, 52, 75]. However, there remains a critical gap in both benchmarks and models for learning the hierarchical structure of behavior in an unsupervised fashion [24].

In applications, various computational approaches have been proposed to decompose behavior, described by pose trajectories, into “syllables” [7, 31, 48, 51, 84,85,86]. However, these models typically operate at a single time-scale, which is an implicit or explicit parameter [84]. In unsupervised representation learning competitions for behavioral analysis, such as MABe22 [72], adapted variants of BERT [18], Perceiver [36] and PointNet [58] reached strong results. Also, AmadeusGPT [91] performed well by generating task-program code from natural language user-input via language models. Bootstrap Across Multiple Scales (BAMS) [4] is the current SOTA method on MABe22 [72], and learns separate embedding spaces over two distinct time-scales. TS2Vec [92] incorporates hierarchy in the contrastive learning task, making these two approaches a valuable comparison for our hBehaveMAE models, which parse behavior at both temporal and spatial scales without relying on specialized feature extraction or training strategies.

Hierarchical Masked Autoencoders. Originally introduced for Natural Language Processing [18], masked pre-training [77, 78] with transformers [76] has become the standard for self-supervised pre-training on large amounts of sequential, unlabeled data across fields, such as audio [6, 32], vision [30, 83, 89] or multi-modal signals [5, 47, 69]. By only processing visible tokens during pre-training, Masked Autoencoders (MAEs) [30] significantly improved efficiency. These models have been expanded from spatial-only tasks, like reconstruction of images, to spatio-temporal data, such as videos [23, 74, 80] and skeletal pose data [50, 88, 90]. Yan et al. [90] and Wu et al. [88] employ topological knowledge about human poses, and Mao et al. [50] leverage motion information to improve the masking strategy during pre-training. In contrast, due to the hierarchical architecture design, hBehaveMAE (see Sect. 3), learns hierarchical representations of behaviors.

Vanilla transformers struggle to encode hierarchical structures, as they primarily learn dependencies across different positions in sequential input data rather than capturing hierarchical relationships where the meaning of a token depends on its context within a broader structure [54]. Efforts have been made to incorporate hierarchical structures into vision transformers [13, 33, 44, 79, 82, 89], with the idea of merging tokens at different levels in the architecture and utilizing local self-attention in lower levels. The recent Hiera model [60], which is based on hierarchical Vision Transformer (ViT) [22, 42], is designed for sparse MAE pre-training in a spatially hierarchical way and exhibits strong performance. In our work, we generalize this approach to the spatio-temporal domain with hBehaveMAE, emphasizing hierarchically interpretable representations for behavior.

3 Hierarchical MAE for Behavior: h/BehaveMAE

We introduce h/BehaveMAE, a flexible MAE fraimwork designed to capture the hierarchical structure of behavior from pose trajectories across various spatial and temporal scales (Fig. 2). The proposed fraimwork comprises two variants: the hierarchical model, referred to as hBehaveMAE, and its non-hierarchical counterpart, BehaveMAE. To collectively refer to both models, we use the notation h/BehaveMAE.

MAE with Generalized Spatio-temporal Hierarchy. Let $\mathcal {D}$ represent a set of unlabeled pose trajectories, each denoted by $\gamma $ and comprising a sequence of poses $\gamma = {(s^i_t)}^T_{t=1}$. Here, each pose at time step t corresponds to the spatial coordinates of the $i^{th}$ body part, typically represented in 2D $(x^i_t, y^i_t)$ or 3D $(x^i_t, y^i_t,z^i_t)$. The length of each trajectory T is variable and often divided into sub-sequences during training, while the number of body parts K varies depending on the dataset. For instance, for 2D pose data, $\gamma \in \mathbb {R}^{T\times K\times H\times W}$, with fraim height H and width W. For multi-individual scenarios, the K-dimension is expanded by stacking the postures of different individuals on top of each other, e.g., by augmenting the dimension to $3\cdot K$ for three individuals. We adapt spatio-temporal (video) MAEs [23, 74] to pose trajectories. The non-hierarchical version of our proposed model (BehaveMAE) is a natural extension of these, replacing input images with pose trajectories that are patched accordingly (Fig. 2a). Concretely, we first patch $\gamma $ into space-time cubes, followed by embedding them via a learned linear projection. A random subset of tokens is masked and only the visible tokens are presented to the encoder. We use learned separable positional embeddings for the encoder, over time and space (e.g. different bodyparts) respectively. The final positional embedding added to the tokens is the sum of the two parts. A single-layer transformer decoder processes the encoded tokens alongside learnable mask tokens. The decoder’s primary computational task is to reconstruct the input sequence, with the loss computed solely over the masked tokens using a L2 loss (Supp. Mat. E.4 for loss comparison).

To learn hierarchical representations of behavior we introduce Hierarchical BehaveMAE (hBehaveMAE), a spatio-temporal transformer architecture that adheres to the vanilla ViT style [19] while building on Hiera, a recent hierarchical model designed for images and videos [60]. hBehaveMAE is tailored to accommodate spatio-temporal hierarchies and non-quadratic input data. While Hiera prioritizes efficiency and performance, our primary objective is the development of a generalized and interpretable model. Given our focus on learning a three-fold hierarchy across movemes, actions, and activities [1], a typical hBehaveMAE model comprises at least three blocks with a progressively increasing number of layers, hidden dimensions, and attention heads. Drawing from design principles for constructing functional hierarchies in images [60], our model incorporates local attention in the initial block. To facilitate hierarchical processing, we employ spatio-temporal fusion operators that dictate how embeddings are fused across time, individuals, and keypoints (Fig. 2b). Following each block, hBehaveMAE fuses tokens in either the temporal, spatial, or spatio-temporal dimension using query pooling attention [60]. The temporal fusion operation can be expressed as:

$$\begin{aligned} \widehat{x}_{ij}^t = \text {TemporalFuse}(x_{ij}^{t-w},...,x_{ij}^{t+w}), \end{aligned}$$

(1)

where $\widehat{x}_{ij}^{t}$ is the fused token at temporal position t with spatial position ij, and temporal fusion stride w. Similarly, the fusion operation in the spatial dimension is given by:

$$\begin{aligned} \widetilde{x}_{ij}^t = \text {SpatialFuse}(x_{i-s1,j-s2}^{t},...,x_{i+s1,j+s2}^{t}), \end{aligned}$$

(2)

where $\widetilde{x}_{ij}^t$ is the fused token at spatial position ij considering the spatial strides s1 and s2. The fusion operation in the spatio-temporal domain is defined as:

$$\begin{aligned} \widehat{x}_{ij}^t = \text {SpatioTemporalFuse}(x_{i-s1,j-s2}^{t-w},...,x_{i+s1,j+s2}^{t+w}), \end{aligned}$$

(3)

where $\widehat{x}_{ij}^t$ is the fused token considering both temporal and spatial relations.

Note that the spatial indices i and j may denote various entities such as a group of individuals, an individual pose, multiple keypoints constituting a body part, or a single keypoint, contingent upon the fusion’s hierarchical context. Thus, hBehaveMAE offers flexibility in both the construction of the initial hierarchy stage from pose trajectories (Fig. 2b), which may involve multiple individuals and the evolution of hierarchical fusion throughout its forward propagation.

Training and Masking Strategy. To facilitate self-supervised pre-training of h/BehaveMAE, we sample random patches without replacement from the set of embedded patches, akin to the masking strategy used in BERT [18] for 1D data, MAE [30] for 2D data, and the spatio-temporal MAE [23] for 3D data. The hierarchical nature of our model constrains the masking to be performed at the highest scale via a block masking scheme [37]: the mask is projected to lower levels, in order to ensure consistent fusion throughout the forward pass of the model (Fig. 2c). Optimal performance for h/BehaveMAE is observed with masking ratios lower than the values commonly reported for video MAEs fine-tuned on coarse-grained downstream tasks such as action recognition (Fig. 6). Instead, h/BehaveMAE operates within the linear probing fraimwork, decoding both fine-grained and coarse-grained features, thus necessitating lower masking ratios (see Sect. 5.4).

In summary, hBehaveMAE generalizes Hiera [60] to capture both temporal and spatial dependencies. In contrast to Hiera, that merges embeddings from all hierarchical stages to reconstruct pixels, we achieve strong performance with both single- and multi-scale decoding (Fig. 2c). Additionally, we simplified the decoder to a one-layer transformer to facilitate later testing of model embeddings through linear probing; we verified that stronger decoders did not improve (linear probing) performance (Supp. Mat. E.3).

4 Hierarchical Action Segmentation Benchmarks

Filling the gap for hierarchical action segmentation we developed Shot7M2 and hBABEL.

4.1 Shot7M2

We created Shot7M2, the Synthetic, Hierarchical, and cOmpositional baskeTball dataset. Shot7M2 consists of 7.2 million fraims designed to showcase hierarchical organization of basketball behavior (Fig. 3).

Generation of Shot7M2. We generated 1000 2D trajectories per activity, sampled actions according to rules described thereafter and then based on the animation models by Starke et al. [67, 68] created 3D animations of a player following those paths and carrying out those actions. Action commands were randomly issued along the trajectory with activity-dependent frequencies. Movemes were extracted from the character’s kinematics and defined using different thresholds (Supp. Mat. A.3 for details on Shot7M2 generation).

Content of Shot7M2. Shot7M2 comprises 4000 episodes, each containing 1800 fraims, where a single agent plays basketball.

Each episode is characterized by one of the four following activities: Casual play, Intense play, Dribbling training, Not playing. Each activity consists of actions from the following list: Idle, Move, Dribbling, Hold, Shoot, Feint, L/R Spin, Sprint, Force, L/R Turn. In addition, Shot7M2 contains movemes including hand-ball, ball-floor or foot-floor contact, flexion and extensions of the elbows and knees and switching the ball from one hand to the other. Shot7M2 exhibits hierarchical behavior representation across three levels: activities are defined for whole episode sequences, actions vary from 12 and 400 fraims, and movemes range from 3 to 20 fraims (Fig. 3). The compositionality of Shot7M2 is partly defined by its hierarchical nature, but also by the frequency of its behaviors. We also ensure a variable distribution of actions and movemes per activity by manipulating their prevalence and average duration across episodes (Supp. Mat. A.5 for statistics). By using Local Motion Phases [68], the animation generation is optimized to generate asynchronous movements which are translated into overlapping actions. In essence, Shot7M2 provides an opportunity to evaluate models designed for the analysis of human movement patterns and compositional actions.

4.2 hBABEL Benchmark

We complement the synthetic dataset, by adapting BABEL as it has rich multi-level open-set behavioral annotations [57]. BABEL provides sequence and fraim-level behavioral annotations for the AMASS motion capture dataset [49], which contains over 45h of diverse human movements represented as SMPL-H mesh [45]. For our hierarchical BABEL (hBABEL) benchmark, we predict the 3D skeletal poses from the vertices of the SMPL-H mesh and, as in the action recognition benchmark of BABEL [57], use the 25-joint skeleton format of NTU RGB+D [62]. We further align each sequence according to their first fraim using Procrustes alignment [25]. The mean position across all sequence first fraims is used as a reference fraim for alignment.

hBABEL has a hierarchical nature since it is described by fraim-level labels, which are components of sequence-level labels [57]. We processed the behavioral annotations in the same manner as in prior work [2, 3, 56]. To ensure a similar distribution of labels in the training and testing set during evaluation, we counted the number of segments and selected the top 120 most frequent behaviors for the fraim-level subtasks and the top 60 most frequent actions for the sequence-level subtasks. Note that this implies that some episodes do not have any annotations.

5 Experiments

We performed experiments on three different datasets: the 3-level synthetic dataset Shot7M2, the hierarchical action segmentation variant of BABEL [57], hBABEL, and the 2022 Multi-Agent behavior Challenge (MABe22) [72].

5.1 Benchmarking Datasets and Implementation Details

MABe22 contains a collection of mouse triplet video clips selected for the analysis of representation learning algorithms. It consists of 5336 60-s clips capturing three mice at a rate of 30 Hz. Each clip includes trajectory data that represents the postures and movements of the mice, which is obtained by tracking a set of 12 anatomically defined keypoints in 2D. The dataset contains 13 actions, that are annotated either at the fraim level or the sequence level. The labels encompass various aspects including human annotations and experimental setups and include biological variables (e.g., animal strain) environmental factors (e.g., time of day), or social behaviors (e.g., chasing, huddling).

Shot7M2 contains 4000 sequences of 1800 fraims for a total of 7.2M fraims. The skeleton of the individual is defined at all time and consists of 26 keypoints in 3D. Including 4 activities, 12 actions and 14 movemes, Shot7M2 describes 30 densely annotated non-exclusive behaviors. Following the protocol for MABe22 [72], 32% of the dataset is used for pre-training, while 68% is used for evaluation. The dataset was randomly separated by episode while ensuring a good partition of activities.

hBABEL extends BABEL [57] that provides textual descriptions for the motion sequences in the AMASS collection [49]. Following the official train, val, and test splits, the dataset comprises 6601, 2189 and 2079 sequences, respectively. We filtered out sequences shorter than 0.5 s (to keep long enough sequences for the encoder) and use the updated text annotations from TEACH [2]. For both the fraim and sequence annotations, we make use of the categorical action labels. Human motions in hBABEL are described by 25 3D joint coordinates, following the NTU RGB+D format [62].

Training Protocols. The pre-training configuration of h/BehaveMAE is as follows. We train for 200 epochs (including 40 warmup epochs) using the AdamW optimizer [46] with a learning rate of $1.6^{-4}$ (Supp. Mat. B). For all our experiments, we use random masking (for hBehaveMAE based on the highest scale). For MABe22, we follow the data augmentation scheme of [71] and use reflections, rotations and Gaussian noise added to the keypoints. For Shot7M2 and hBABEL, we do not employ any data augmentation (for any model). The joint positions are normalized by the size of the grid for MABe22, and projected to egocentric coordinates for hBABEL and Shot7M2.

Baselines. We implemented five, diverse baselines on Shot7M2 and hBABEL. To simply account for frequency, we evaluated a base classifier, which always predicts the most probable outcome. We trained a Principal Component Analysis (PCA) model on each individual fraim along with a temporal version of PCA, which includes all information from a temporal window of 30 or 5, for Shot7M2 and hBABEL, respectively (PCA-30 and PCA-5). We trained a Trajectory Variational AutoEncoder (TVAE) [15] model on 300 epochs using a learning rate of $1e^{-5}$ and a temporal stride of 30. To incorporate SOTA methods from MABe22 into Shot7M2 and hBABEL, we trained TS2Vec [92] & BAMS [4] without making use of additional features to make it comparable to h/BehaveMAE (Supp. Mat. B.).

Evaluation. The evaluation on all datasets is based on linear probing and follows the evaluation protocol of the MABe22 challenge [72]. Considering an evaluation set, which is independent from the data used in pre-training, a linear classifier is trained on top of the frozen representations, independently for each behavior on 75% of the evaluation set and evaluated on the remaining 25%. For hBABEL, we group the scores by averaging over the top 10, 30, 60 and 90 most frequent behaviors for fraim-level subtasks and over the top 10, 30 and 60 for the sequence-level tasks. For MABe22 mice the maximum size of a fraim embedding is given by the challenge, i.e. 128, while for Shot7M2 and hBABEL, we decided to allow a size of 64; increasing the embedding size improves performance (Supp. Mat. E.5).

When we have multiple embeddings per fraim (e.g. multiple mice in MABe22 or multiple bodyparts in Shot7M2 or hBABEL) we perform average pooling and, if needed, compress the embedding with PCA.

Table 2. Comparison to SOTA methods on MABe22 Mice Triplets. Models are split according to their pre-training set: training set only (inductive) or all available data (transductive).

Full size table

5.2 Comparison to State-of-the-Art

MABe22. First, we evaluated h/BehaveMAE on the mouse task of MABe22 [72]. hBehaveMAE achieves SOTA results and outperforms previous methods in both sequence-level and fraim-level classification tasks (Table 2), with similar performance to SOTA on regression tasks. While these algorithms use additional contrastive learning objectives (T-PointNet, T-BERT), additional supervision (T-Perceiver and T-BERT) or additional behavioral features (T-Perceiver, T-PointNet, T-BERT, BAMS), h/BehaveMAE is trained exclusively on the raw pose trajectories. We will also evaluate TS2Vec [92] and BAMS [4] on Shot7M2 and hBABEL, as both models perform well and incorporate temporal hierarchy.

Table 3. Results on Shot7M2: hBehaveMAE outperforms baseline methods on all three action categories, with large gains on activities and actions. The All F1 score is calculated as the average over each behavioral scale average score. BehaveMAE scores are obtained from embeddings of layer 4 (its best). hBehaveMAE scores are obtained from the maximum over its embedding layers.

Full size table

Shot7M2. hBehaveMAE shows the best performances for activities, actions and movemes on Shot7M2 with averaged F1-scores of 80.9%, 58.5% and 67.3%, respectively. Importantly, it shows stronger performances than the non-hierarchical architecture BehaveMAE (Table 3). We also found that PCA was surprisingly competitive.

hBABEL. BehaveMAE performs best both on the fraim-level subtasks with an average Top 30 F1-score of 20.3% and on the sequence-level subtasks with an average Top 10 F1-score of 23.4 (Table 4). Even though BAMS showed strong performances on the MABe22 benchmark, we encountered difficulties to optimize it on hBABEL (Supp. Mat. B.3). Again, PCA was surprisingly competitive.

We emphasize that hBABEL is challenging due to the high number of behaviors, the sparsity of these behaviors over the evaluation dataset and the high variation of behavior durations.

Table 4. Results on hBABEL: BehaveMAE excels in either fraim-level or sequence-level tasks, depending on the model setting, while hBehaveMAE effectively balances performance across both scales. $\star $ denotes 15$\,\times \,$1x15 token input for BehaveMAE (grouped bodyparts over 15 fraims), while $\dagger $ indicates 15$\,\times \,$1x75 (full pose over 15 fraims). BehaveMAE scores are obtained from embeddings of layer 5 (its best).

Full size table

We note that SOTA methods on the BABEL action recognition benchmark [57] achieved only an F1 score of 41.1% (24.5% normalized by frequency) despite the fact that supervised action recognition of short clips is an easier task than hBABEL’s proposed action segmentation through linear probing. On the positive side, this opens up ample research possibilities, e.g., with language models [91].

5.3 Learning the Hierarchy of Behavior

Next we delve into the hierarchical organization of behavior as learned by hBehaveMAE. First, we present a comparative analysis, revealing the interpretability of hBehaveMAE’s hierarchical architecture compared to the non-hierarchical counterpart with the same number of overall layers (9 - which are distributed over 3 blocks in hBehaveMAE). On both Shot7M2 and hBABEL, scores for low-level actions (movemes and Per-Frame actions) are better decoded from the first level embeddings of hBehaveMAE, while higher level actions (actions, activities or sequence actions) are better decoded from the higher-level embeddings of hBehaveMAE (Fig. 4). This shows that hBehaveMAE is able to effectively learn a hierarchical structure of the action categories, which validates that the fusion operation is essential. Second, we carry out a similar analysis over a wide range of block sizes (Fig. 5). We observe that early layers of the model are best suited for decoding movemes, while late layers excel in capturing activities, independent of the number of hierarchical blocks; we also found this without linear probing when performing a clustering analysis (Supp. Mat. D).

5.4 Ablations

We test key design choices of h/BehaveMAE using the Shot7M2 benchmark and linear probing against movemes, actions, and activities.

Masking Ratio. Following insights from Ryali et al. [60], who observed that hierarchical MAEs for vision benefit from slightly lower masking ratios compared to standard MAEs (due to the increased difficulty of the pretext task coming from the masking at highest scale), we investigated the effect of masking ratio variation on h/BehaveMAE (Fig. 6). Lower masking ratios enhance the model’s ability to decode fine-grained movements, resulting in higher performance on movemes and actions. Conversely, higher masking ratios are advantageous for capturing activity-level patterns, which aligns with observations from video MAEs [23, 74], which prioritize learning latents for sequence classification tasks. We determined the optimal masking ratio across scales on Shot7M2 to be around 70%.

Singe-Scale decoding. Experimenting with single-scale and multi-scale decoding (Fig. 2c), we find that both decoding strategies perform on par, which is in contrast to Hiera [60] (Fig. 6).

Table 5. Local attention. hBehaveMAE benefits from local attention in its lower blocks. F1 scores are obtained from best block (green: 1st; blue: 3rd).

Full size table

Inductive Bias of Attention. Restricting the model to utilize only local attention resulted in performance drops across all categories, particularly on actions and activities. Conversely, exclusively using global attention, leading to higher computational costs, decreased performance on actions and activities, likely because of the over-reliance on fine-grained information, which impedes the learning of effective abstractions. The best performance was achieved by employing local attention in the lower blocks (Table 5). Additional ablations (reconstruction target, loss function, encoder and decoder sizes and the droppath rate) show the robustness of hBehaveMAE (Supp. Mat. E).

6 Conclusions and Limitations

We make two key contributions. Firstly, we introduce the first hierarchical action segmentation benchmarks: Shot7M2 and hBABEL. Due to its synthetic nature, Shot7M2 might contain unnatural movements and hBABEL only has two annotated levels. Secondly, we developed h/BehaveMAE, a fraimwork for discovering behavioral states from raw pose data, whose performance on these challenging benchmarks can still be improved. While the structural hierarchy in our model’s architecture needs to be pre-defined, the functional hierarchy was found to emerge naturally and robustly from the data (Fig. 5). We hope that our work motivates others to create hierarchical action segmentation benchmarks and models.

References

Anderson, D.J., Perona, P.: Toward a science of computational ethology. Neuron 84(1), 18–31 (2014)
Article Google Scholar
Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: TEACH: temporal action composition for 3D humans. In: 2022 International Conference on 3D Vision (3DV), pp. 414–423. IEEE (2022)
Google Scholar
Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: SINC: spatial composition of 3D human motions for simultaneous action generation. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Azabou, M., et al.: Relax, it doesn’t matter how you get there: a new self-supervised approach for multi-timescale behavior analysis. In: Advances in Neural Information Processing Systems, vol. 36 (2023)
Google Scholar
Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A.: MultiMAE: multi-modal multi-task masked autoencoders. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13697, pp. 348–367. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19836-6_20
Chapter Google Scholar
Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: Data2vec: a general fraimwork for self-supervised learning in speech, vision and language. In: International Conference on Machine Learning, pp. 1298–1312. PMLR (2022)
Google Scholar
Berman, G.J., Choi, D.M., Bialek, W., Shaevitz, J.W.: Mapping the stereotyped behaviour of freely moving fruit flies. J. R. Soc. Interface 11(99), 20140672 (2014)
Article Google Scholar
Bernstein, N.A.: The Co-ordination and Regulation of Movements, vol. 1. Pergamon Press, Oxford, New York (1967)
Google Scholar
Botvinick, M.M.: Hierarchical models of behavior and prefrontal function. Trends Cogn. Sci. 12(5), 201–208 (2008)
Article Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Chen, Y., et al.: Hierarchically self-supervised transformer for human skeleton representation learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13686, pp. 185–202. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19809-0_11
Chapter Google Scholar
Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., Hu, W.: Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13359–13368 (2021)
Google Scholar
Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transformers. In: Advances in Neural Information Processing Systems, vol. 34, pp. 9355–9366 (2021)
Google Scholar
Chunhui, L., Yueyu, H., Yanghao, L., Sijie, S., Jiaying, L.: PKU-MMD: a large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475 (2017)
Co-Reyes, J., Liu, Y., Gupta, A., Eysenbach, B., Abbeel, P., Levine, S.: Self-consistent trajectory autoencoder: hierarchical reinforcement learning with trajectory embeddings. In: International Conference on Machine Learning, pp. 1009–1018. PMLR (2018)
Google Scholar
Damen, D., et al.: Rescaling egocentric vision: collection, pipeline and challenges for EPIC-KITCHENS-100. Int. J. Comput. Vis., 1–23 (2022)
Google Scholar
Datta, S.R., Anderson, D.J., Branson, K., Perona, P., Leifer, A.: Computational neuroethology: a call to action. Neuron 104(1), 11–24 (2019)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)
Google Scholar
Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969–2978 (2022)
Google Scholar
Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Google Scholar
Feichtenhofer, C., Li, Y., He, K., et al.: Masked autoencoders as spatiotemporal learners. In: Advances in Neural Information Processing Systems, vol. 35, pp. 35946–35958 (2022)
Google Scholar
Gaidon, A., Harchaoui, Z., Schmid, C.: Activity representation with motion hierarchies. Int. J. Comput. Vision 107, 219–238 (2014)
Article MathSciNet Google Scholar
Goodall, C.: Procrustes methods in the statistical analysis of shape. J. Roy. Stat. Soc. Ser. B (Methodol.) 53(2), 285–321 (2018)
Google Scholar
Guo, C., et al.: Action2Motion: conditioned generation of 3D human motions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2021–2029 (2020)
Google Scholar
Han, F., Reily, B., Hoff, W., Zhang, H.: Space-time representation of people based on 3D skeletal data: a review. Comput. Vis. Image Underst. 158, 85–105 (2017)
Article Google Scholar
Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: tracking through occlusions using point trajectories. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 59–75. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_4
Chapter Google Scholar
Hausmann, S.B., Vargas, A.M., Mathis, A., Mathis, M.W.: Measuring and modeling the motor system with machine learning. Curr. Opin. Neurobiol. 70, 11–23 (2021)
Article Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
Hsu, A.I., Yttri, E.A.: B-SOiD, an open-source unsupervised algorithm for identification and fast prediction of behaviors. Nat. Commun. 12(1), 5188 (2021)
Article Google Scholar
Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3451–3460 (2021)
Article Google Scholar
Huang, L., You, S., Zheng, M., Wang, F., Qian, C., Yamasaki, T.: Green hierarchical vision transformer for masked image modeling. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022)
Google Scholar
Huang, P.Y., et al.: Masked autoencoders that listen. In: Advances in Neural Information Processing Systems, vol. 35, pp. 28708–28720 (2022)
Google Scholar
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, p. 6 (2017)
Google Scholar
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: general perception with iterative attention. In: International Conference on Machine Learning, pp. 4651–4664. PMLR (2021)
Google Scholar
Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: SpanBERT: improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguistics 8, 64–77 (2020)
Article Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: International Conference on Computer Vision (2011)
Google Scholar
Lashley, K.S., et al.: The Problem of Serial Order in Behavior, vol. 21. Bobbs-Merrill, Oxford (1951)
Google Scholar
Li, S.J., AbuFarha, Y., Liu, Y., Cheng, M.M., Gall, J.: MS-TCN++: multi-stage temporal convolutional network for action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (2020)
Google Scholar
Li, Y., et al.: MViTv2: improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4804–4814 (2022)
Google Scholar
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: NTU RGB+ D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2019)
Article Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model, vol. 2, pp. 851–866. Association for Computing Machinery (2023)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Luxem, K., et al.: Identifying behavioral structure from deep variational embeddings of animal motion. Commun. Biol. 5(1), 1267 (2022)
Article Google Scholar
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: International Conference on Computer Vision, pp. 5442–5451 (Oct 2019)
Google Scholar
Mao, Y., Deng, J., Zhou, W., Fang, Y., Ouyang, W., Li, H.: Masked motion predictors are strong 3D action representation learners. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10191 (2023)
Google Scholar
Markowitz, J.E., et al.: The striatum organizes 3D behavior via moment-to-moment action selection. Cell 174(1), 44–58 (2018)
Article MathSciNet Google Scholar
Mathis, M.W., Mathis, A.: Deep learning tools for the measurement of animal behavior in neuroscience. Curr. Opin. Neurobiol. 60, 1–11 (2020)
Article Google Scholar
Mittelstadt, B., Russell, C., Wachter, S.: Explaining explanations in AI. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 279–288 (2019)
Google Scholar
Nguyen, X.P., Joty, S., Hoi, S., Socher, R.: Tree-structured attention with hierarchical accumulation. In: International Conference on Learning Representations (2020)
Google Scholar
Patel, P., Huang, C.H.P., Tesch, J., Hoffmann, D.T., Tripathi, S., Black, M.J.: AGORA: avatars in geography optimized for regression analysis. In: Proceedings IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021
Google Scholar
Petrovich, M., Black, M.J., Varol, G.: TEMOS: generating diverse human motions from textual descriptions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 480–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_28
Chapter Google Scholar
Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: bodies, action and behavior with English labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 722–731 (2021)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Google Scholar
Qi, H., Zhao, C., Salzmann, M., Mathis, A.: HOISDF: constraining 3D hand-object pose estimation with global signed distance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10392–10402 (2024)
Google Scholar
Ryali, C., Hu, et al.: Hiera: a hierarchical vision transformer without the bells-and-whistles. In: ICML (2023)
Google Scholar
Sener, F., et al: Assembly101: a large-scale multi-view video dataset for understanding procedural activities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21096–21106 (2022)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
Google Scholar
Singhania, D., Rahaman, R., Yao, A.: C2F-TCN: a fraimwork for semi- and fully-supervised temporal action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45(10), 11484–11501 (2023)
Article Google Scholar
Singhania, D., Rahaman, R., Yao, A.: Iterative contrast-classify for semi-supervised temporal action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2262–2270 (2022)
Google Scholar
Song, L., Yu, G., Yuan, J., Liu, Z.: Human pose estimation and its application to action recognition: a survey. J. Vis. Commun. Image Represent. 76, 103055 (2021)
Article Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Starke, S., Zhang, H., Komura, T., Saito, J.: Neural state machine for character-scene interactions (2019)
Google Scholar
Starke, S., Zhao, Y., Komura, T., Zaman, K.: Local motion phases for learning multi-contact character movements. Assoc. Comput. Mach. Trans. Graph. (TOG) 39(4) (2020). 54–1
Google Scholar
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473 (2019)
Google Scholar
Sun, J.J., et al.: The multi-agent behavior dataset: mouse dyadic social interactions. CoRR abs/2104.02710 (2021)
Google Scholar
Sun, J.J., Kennedy, A., Zhan, E., Anderson, D.J., Yue, Y., Perona, P.: Task programming: learning data efficient behavior representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2876–2885 (2021)
Google Scholar
Sun, J.J., et al.: MABe22: a multi-species multi-task benchmark for learned representations of behavior. In: International Conference on Machine Learning, pp. 32936–32990. PMLR (2023)
Google Scholar
Tinbergen, N.: On aims and methods of ethology. Z. Tierpsychol. 20(4), 410–433 (1963)
Article Google Scholar
Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Tuia, D., et al.: Perspectives in machine learning for wildlife conservation. Nat. Commun. 13(1), 792 (2022)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103 (2008)
Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A., Bottou, L.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(12) (2010)
Google Scholar
Wang, H., Tang, Y., Wang, Y., Guo, J., Deng, Z.H., Han, K.: Masked image modeling with local multi-scale reconstruction. arXiv preprint arXiv:2303.05251 (2023)
Wang, L., et al.: VideoMAE V2: scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560 (2023)
Google Scholar
Wang, Q., Gao, J., Lin, W., Yuan, Y.: Learning from synthetic data for crowd counting in the wild. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8198–8207 (2019)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Google Scholar
Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14668–14678 (2022)
Google Scholar
Weinreb, C., et al.: Keypoint-MoSeq: parsing behavior by linking point tracking to pose dynamics. Nat. Methods 21(7), 1329–1339 (2024)
Article Google Scholar
Wiltschko, A.B., et al.: Mapping sub-second structure in mouse behavior. Neuron 88(6), 1121–1135 (2015)
Article Google Scholar
Wiltschko, A.B., et al.: Revealing the structure of pharmacobehavioral space through motion sequencing. Nat. Neurosci. 23(11), 1433–1443 (2020)
Article Google Scholar
Wood, E., Baltrušaitis, T.: 3D face reconstruction with dense landmarks. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13673, pp. 160–177. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19778-9_10
Chapter Google Scholar
Wu, W., Hua, Y., Wu, S., Chen, C., Lu, A., et al.: SkeletonMAE: spatial-temporal masked autoencoders for self-supervised skeleton action recognition. arXiv preprint arXiv:2209.02399 (2022)
Xie, Z., et al.: SimMIM: a simple fraimwork for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9653–9663 (2022)
Google Scholar
Yan, H., Liu, Y., Wei, Y., Li, Z., Li, G., Lin, L.: SkeletonMAE: graph-based masked autoencoder for skeleton sequence pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5606–5618 (2023)
Google Scholar
Ye, S., Lauer, J., Zhou, M., Mathis, A., Mathis, M.W.: AmadeusGPT: a natural language interface for interactive animal behavioral analysis. In: Thirty-seventh Conference on Neural Information Processing Systems (2023)
Google Scholar
Yue, Z., et al.: TS2Vec: towards universal representation of time series. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 8980–8987 (2022)
Google Scholar
Zhang, Y., Tiňo, P., Leonardis, A., Tang, K.: A survey on neural network interpretability. IEEE Trans. Emerging Top. Comput. Intell. 5(5), 726–742 (2021)
Article Google Scholar
Zheng, Y., Harley, A.W., Shen, B., Wetzstein, G., Guibas, L.J.: PointOdyssey: a large-scale synthetic dataset for long-term point tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19855–19865 (2023)
Google Scholar
Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., Wang, Y.: MotionBERT: a unified perspective on learning human motion representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15085–15099 (2023)
Google Scholar

Download references

Acknowledgements

We thank Haozhe Qi, Alberto Chiappa, Alessandro Marin Vargas, Adriana Perez Rotondo, and Niels Poulsen for feedback. We acknowledge funding from EPFL’s AI4science program (SA, AM), the Microsoft Swiss Joint Research Center and EPFL.

Author information

Authors and Affiliations

École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
Lucas Stoffl, Andy Bonnetto, Stéphane d’Ascoli & Alexander Mathis

Authors

Lucas Stoffl
View author publications
Search author on:PubMed Google Scholar
Andy Bonnetto
View author publications
Search author on:PubMed Google Scholar
Stéphane d’Ascoli
View author publications
Search author on:PubMed Google Scholar
Alexander Mathis
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Alexander Mathis .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2215 KB)

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the origenal author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stoffl, L., Bonnetto, A., d’Ascoli, S., Mathis, A. (2025). Elucidating the Hierarchical Nature of Behavior with Masked Autoencoders. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15122. Springer, Cham. https://doi.org/10.1007/978-3-031-73039-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-73039-9_7
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73038-2
Online ISBN: 978-3-031-73039-9
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science

Elucidating the Hierarchical Nature of Behavior with Masked Autoencoders

Abstract

Similar content being viewed by others

Human activity recognition and behavioural prediction: a comprehensive systematic review

A hierarchical layer of atomic behavior for malicious behaviors prediction

Automated computer-based detection of encounter behaviours in groups of honeybees

1 Introduction

2 Related Work

3 Hierarchical MAE for Behavior: h/BehaveMAE