| ## [VHASR: A Multimodal Speech Recognition System With Vision Hotwords](https://arxiv.org/abs/2410.00822) | |
| This repository provides the VHASR trained on OpenImages. | |
| Our paper is available at https://arxiv.org/abs/2410.00822. | |
| Our code is available at https://github.com/193746/VHASR/tree/main. | |
| For specific details about training and testing, please refer to https://github.com/193746/VHASR/tree/main. | |
| ### Infer | |
| If you are interested in our work, you can use large-scale data to train your own model and perform inference using the following command. Note that you should place the config file of clip in '{model_file}/clip_config' like the four pretrained models we provide. | |
| ```sh | |
| cd VHASR | |
| CUDA_VISIBLE_DEVICES=1 python src/infer.py \ | |
| --model_name "{path_to_model_folder}" \ | |
| --speech_path "{path_to_speech}" \ | |
| --image_path "{path_to_image}" \ | |
| --merge_method 3 | |
| ``` | |
| ### Citation | |
| ```sh | |
| @misc{hu2024vhasrmultimodalspeechrecognition, | |
| title={VHASR: A Multimodal Speech Recognition System With Vision Hotwords}, | |
| author={Jiliang Hu and Zuchao Li and Ping Wang and Haojun Ai and Lefei Zhang and Hai Zhao}, | |
| year={2024}, | |
| eprint={2410.00822}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.SD}, | |
| url={https://arxiv.org/abs/2410.00822}, | |
| } | |
| ``` | |
| ### License: cc-by-nc-4.0 |