张绍磊
commited on
Commit
·
cf6a785
1
Parent(s):
54f2fa1
update
Browse files
README.md
CHANGED
|
@@ -6,13 +6,15 @@ tags:
|
|
| 6 |
---
|
| 7 |
# Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
|
| 8 |
|
| 9 |
-
[](https://huggingface.co/ICTNLP/stream-omni-8b)
|
| 11 |
[](https://huggingface.co/datasets/ICTNLP/InstructOmni)
|
| 12 |
[](https://github.com/ictnlp/Stream-Omni)
|
| 13 |
|
| 14 |
> [**Shaolei Zhang**](https://zhangshaolei1998.github.io/), [**Shoutao Guo**](https://scholar.google.com.hk/citations?user=XwHtPyAAAAAJ), [**Qingkai Fang**](https://fangqingkai.github.io/), [**Yan Zhou**](https://zhouyan19.github.io/zhouyan/), [**Yang Feng**](https://people.ucas.edu.cn/~yangfeng?language=en)\*
|
| 15 |
|
|
|
|
| 16 |
|
| 17 |
Stream-Omni is an end-to-end language-vision-speech chatbot that simultaneously supports interaction across various modality combinations, with the following features💡:
|
| 18 |
- **Omni Interaction**: Support any multimodal inputs including text, vision, and speech, and generate both text and speech responses.
|
|
@@ -32,6 +34,3 @@ Stream-Omni is an end-to-end language-vision-speech chatbot that simultaneously
|
|
| 32 |
> [!NOTE]
|
| 33 |
>
|
| 34 |
> **Stream-Omni can produce intermediate textual results (ASR transcription and text response) during speech interaction, offering users a seamless "see-while-hear" experience.**
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
The introduction and usage of Stream-Omni refer to [https://github.com/ictnlp/Stream-Omni](https://github.com/ictnlp/Stream-Omni).
|
|
|
|
| 6 |
---
|
| 7 |
# Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
|
| 8 |
|
| 9 |
+
[](https://arxiv.org/abs/2506.13642)
|
| 10 |
+
[](https://github.com/ictnlp/Stream-Omni)
|
| 11 |
[](https://huggingface.co/ICTNLP/stream-omni-8b)
|
| 12 |
[](https://huggingface.co/datasets/ICTNLP/InstructOmni)
|
| 13 |
[](https://github.com/ictnlp/Stream-Omni)
|
| 14 |
|
| 15 |
> [**Shaolei Zhang**](https://zhangshaolei1998.github.io/), [**Shoutao Guo**](https://scholar.google.com.hk/citations?user=XwHtPyAAAAAJ), [**Qingkai Fang**](https://fangqingkai.github.io/), [**Yan Zhou**](https://zhouyan19.github.io/zhouyan/), [**Yang Feng**](https://people.ucas.edu.cn/~yangfeng?language=en)\*
|
| 16 |
|
| 17 |
+
The introduction and usage of Stream-Omni refer to [https://github.com/ictnlp/Stream-Omni](https://github.com/ictnlp/Stream-Omni).
|
| 18 |
|
| 19 |
Stream-Omni is an end-to-end language-vision-speech chatbot that simultaneously supports interaction across various modality combinations, with the following features💡:
|
| 20 |
- **Omni Interaction**: Support any multimodal inputs including text, vision, and speech, and generate both text and speech responses.
|
|
|
|
| 34 |
> [!NOTE]
|
| 35 |
>
|
| 36 |
> **Stream-Omni can produce intermediate textual results (ASR transcription and text response) during speech interaction, offering users a seamless "see-while-hear" experience.**
|
|
|
|
|
|
|
|
|