LanguageBind-MLP Model
Model Description
This is a fine-tuned LanguageBind model for detecting machine-generated content across multiple modalities (text, image, and audio). The model is part of the RU-AI project, which introduces a large multimodal dataset for AI-generated content detection.
This model leverages LanguageBind's multi-modal semantic alignment capabilities to identify whether content is human-generated or machine-generated across different modalities.
Model Details
- Model Type: Multi-modal classification model based on LanguageBind
- Architecture: LanguageBind with MLP classifier head
- Paper: RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection
- GitHub Repository: ZhihaoZhang97/RU-AI
- Accepted at: WWW'25 Resource Track
- Modalities Supported: Text, Image, and Audio
Intended Use
This model is designed for detecting AI-generated content in:
- Text: Identifying AI-written articles, essays, responses, and general text
- Images: Detecting images generated by models like Stable Diffusion, DALL-E, etc.
- Audio: Identifying synthetic speech from TTS models
Use Cases
- Content moderation and authenticity verification
- Academic integrity checking
- Media forensics and fact-checking
- Research on AI-generated content detection
Training Data
The model was trained on the RU-AI dataset, which includes:
- 245,895 real/human-generated samples
- 1,229,475 machine-generated samples
- Multiple data sources: COCO, Flickr8k, Places dataset
- AI-generated content from various models:
- Images: Stable Diffusion (v1.5, v6.0, XL v3.0, AbsoluteReality, EpicRealism)
- Audio: EfficientSpeech, StyleTTS2, VITS, XTTS2, YourTTS
- Text: Various LLM-generated captions and descriptions
Dataset is publicly available at Zenodo.
Requirements
Hardware
- NVIDIA GPU with at least 16GB VRAM (RTX 3090 24GB or higher recommended)
- At least 500GB disk space for the full dataset
Software
- Python >= 3.8
- PyTorch >= 1.13.1
- CUDA >= 11.6
Installation
# Clone the repository
git clone https://github.com/ZhihaoZhang97/RU-AI.git
cd RU-AI
# Create virtual environment
conda create -n ruai python=3.8
conda activate ruai
# Install dependencies
pip3 install -r requirements.txt
Usage
Model Inference
# See infer_languagebind_model.py in the GitHub repository
python infer_languagebind_model.py
Before running inference, you need to:
- Download the dataset or prepare your own data
- Update the data paths in
infer_languagebind_model.py:image_data_pathsaudio_data_pathstext_data
Quick Start with Sample Data
# Download Flickr8k sample data
python ./download_flickr.py
# Or download the full dataset (157GB compressed, 500GB uncompressed)
python ./download_all.py
Model Performance
This model is designed to detect AI-generated content across multiple modalities simultaneously, leveraging LanguageBind's language-based semantic alignment to create unified representations.
For detailed performance metrics and evaluation results, please refer to the paper.
Limitations
- The model's performance depends on the quality and diversity of training data
- May not generalize well to AI models or techniques not represented in the training set
- Detection accuracy may vary across different modalities
- Requires significant computational resources for inference
Ethical Considerations
This model is intended for research and legitimate content verification purposes. Users should:
- Consider privacy implications when analyzing user-generated content
- Be aware of potential biases in training data
- Use the model responsibly and not for censorship without human oversight
- Understand that detection is probabilistic and may produce false positives/negatives
Citation
If you use this model in your research, please cite:
@misc{huang2024ruai,
title={RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection},
author={Liting Huang and Zhihao Zhang and Yiran Zhang and Xiyue Zhou and Shoujin Wang},
year={2024},
eprint={2406.04906},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Acknowledgments
This work builds upon:
- LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
- ImageBind: One Embedding Space To Bind Them All
We appreciate the open-source community for the datasets and models that made this work possible.
License
Please refer to the GitHub repository for license information.
Contact
For questions and issues:
- Open an issue on the GitHub repository
- Refer to the paper for contact information of the authors