Chronologically Accurate Retrieval for Temporal Grounding of
Motion-Language Models

to appear at ECCV 2024

LY Corporation

Chronologically Accurate Retrieval (CAR). We present a simple strategy to reinforce motion-language models and achieve better temporal alignment between text and motion.

Abstract

With the release of large-scale motion datasets with textual annotations, the task of establishing a robust latent space for language and 3D human motion has recently witnessed a surge of interest. Despite these efforts to align language and motion representations, we claim that the temporal element is often overlooked, especially for compound actions, resulting in chronological inaccuracies. To shed light on the temporal alignment in motion-language latent spaces, we propose Chronologically Accurate Retrieval (CAR) to evaluate the chronological understanding of the models. We decompose textual descriptions into events, and prepare negative text samples by shuffling the order of events in compound action descriptions. We then design a simple task for motion-language models to retrieve the more likely text from the ground truth and its chronologically shuffled version. CAR reveals many cases where current motion-language models fail to distinguish the event chronology of human motion, despite their impressive performance in terms of conventional evaluation metrics. To achieve better temporal alignment between text and motion, we further propose to use these texts with shuffled sequence of events as negative samples during training to reinforce the motion-language models. We conduct experiments on text-motion retrieval and text-to-motion generation using the reinforced motion-language models, which demonstrate improved performance over conventional approaches, indicating the necessity to consider temporal elements in motion-language alignment.

Proposal

CAR: Chronologically Accurate Retrieval

To analyze the text-motion alignment, we use the text annotations provided in HumanML3D. We decompose the text descriptions into events using off the shelf LLM. We leverage the event descriptions to conduct a test for the chronological understanding of the motion-language models. We shuffle the order of events in the text descriptions to create negative samples for the models to distinguish. CAR measures the success rate of models in retrieving the description with the correct order.

Framework

We propose a simple solution to reinforce motion-language models in terms of chronology. In addition to the original text descriptions, we also use the shuffled texts as negative samples during contrastive learning Negative shuffled samples are important resource for models to comprehend the correct order of events.

Experiments

Text-Motion Retrieval

Comparison of retrieval between TMR and the proposed method. Given a motion, models are asked to retrieve the correct description from original and shuffled texts. The reinforced model from our proposal shows better performance in distinguishing the correct order of events.

Motion Generation

Comparison of generated motions between T2M-GPT and the same model reinfoced by our text encoder. The resulting motions from our method capture all the events that are indicated in the description in the correct order.

BibTeX

@InProceedings{Fujiwara_2024_ECCV, 
      author = {Kent Fujiwara and Mikihiro Tanaka and Qing Yu}, 
      title = {Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models}, 
      booktitle = {Proc. of the European Conf. on Computer Vision (ECCV)}, 
      year = {2024}, 
    }