# Official models of "MoChat: Joints-Grouped Spatio-Temporal Grounding LLM for Multi-Turn Motion Comprehension and Description"

## Overview

MoChat is a Multimodal Large Language Model (MLLM) that revolutionizes human motion understanding through precise spatio-temporal grounding. Unlike conventional motion analysis systems, MoChat integrates:
- **Motion Understanding**: Performs fundamental motion comprehension and summarization.
- **Spatial Limb Grounding**: Accurately locates body parts involved in described movements.
- **Temporal Action Grounding**: Precisely identifies time boundaries corresponding to specific motion descriptions.

## Models

We provide the following trained models for download:  
- **[Joints-Grouped Skeleton Encoder](https://huggingface.co/CSUBioGroup/MoChat/blob/main/JGSE_epoch120)** for motion sequences representation.  
- Two variants of motion comprehension models:  
  - [MoChat](https://huggingface.co/CSUBioGroup/MoChat/tree/main/MoChat): Base model. 
  - [MoChat-R](https://huggingface.co/CSUBioGroup/MoChat/tree/main/MoChat-R): Extended model with regression head.

## Resources
- **Codebase**: [Github](https://github.com/CSUBioGroup/MoChat)  
- **Paper**: [Arxiv](https://arxiv.org/abs/2410.11404)