whisperkittools generated README.md
Browse files
README.md
CHANGED
|
@@ -13,22 +13,44 @@ tags:
|
|
| 13 |
|
| 14 |
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
We believe that rigorously measuring the "quality of inference" is necessary for developers and
|
| 17 |
enterprises to make informed decisions when opting to use optimized or compressed variants of
|
| 18 |
any machine learning model in production. To contextualize `WhisperKit`, we take the following Whisper
|
| 19 |
implementations and benchmark them using a consistent evaluation harness:
|
| 20 |
|
| 21 |
-
Server-side
|
| 22 |
-
- `WhisperOpenAIAPI`: [OpenAI's Whisper API](https://platform.openai.com/docs/guides/speech-to-text) ($0.36
|
| 23 |
|
| 24 |
-
On-device
|
| 25 |
-
- `WhisperKit`: Argmax's
|
| 26 |
- `whisper.cpp`: A C++ implementation form ggerganov [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L212) [[Repo]](https://github.com/ggerganov/whisper.cpp)
|
| 27 |
- `WhisperMLX`: A Python implementation from Apple MLX [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L338) [[Repo]](https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py)
|
| 28 |
|
| 29 |
`WhisperOpenAIAPI` sets the reference and we assume that it is using the equivalent of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)
|
| 30 |
in float16 precision along with additional undisclosed optimizations from OpenAI. In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
|
| 31 |
-
which is a stricter metric compared to dataset average WER. A 100% `qoi` preserves perfect backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
|
| 32 |
where per-example known behavior changes after a code/model update and causes divergence in downstream code or breaks the user experience itself (even if dataset averages might stay flat
|
| 33 |
across updates). Pseudocode for `qoi`:
|
| 34 |
|
|
@@ -61,28 +83,6 @@ evaluation in under 1 hour regardless of the Whisper implementation. Older Apple
|
|
| 61 |
|
| 62 |
|
| 63 |
|
| 64 |
-
## Dataset: `librispeech`
|
| 65 |
-
|
| 66 |
-
### Quality Evaluation
|
| 67 |
-
|
| 68 |
-
| | WER | QoI (%) | File Size (MB) |
|
| 69 |
-
|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|----------:|-----------------:|
|
| 70 |
-
| [WhisperOpenAIAPI/openai_whisper-large-v2](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2/librispeech) | 2.85 | 100 | 3100 |
|
| 71 |
-
| [WhisperKit/openai_whisper-large-v3](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3/librispeech) | 2.48 | 95.2 | 3100 |
|
| 72 |
-
| [WhisperKit/openai_whisper-large-v3_turbo](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3_turbo/librispeech) | 2.44 | 95.4 | 3100 |
|
| 73 |
-
| [WhisperKit/openai_whisper-large-v3_turbo_1018MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3_turbo_1018MB/librispeech) | 2.49 | 94.8 | 1018 |
|
| 74 |
-
| [WhisperKit/openai_whisper-large-v2](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2/librispeech) | 3.28 | 96.6 | 3100 |
|
| 75 |
-
| [WhisperKit/openai_whisper-large-v2_1050MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_1050MB/librispeech) | 3.32 | 95 | 1050 |
|
| 76 |
-
| [WhisperKit/openai_whisper-large-v2_turbo](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_turbo/librispeech) | 3.24 | 96.6 | 3100 |
|
| 77 |
-
| [WhisperKit/openai_whisper-large-v2_turbo_1022MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_turbo_1022MB/librispeech) | 3.33 | 94.9 | 1022 |
|
| 78 |
-
| [WhisperKit/openai_whisper-small.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small.en/librispeech) | 4.31 | 85.9 | 483 |
|
| 79 |
-
| [WhisperKit/openai_whisper-small](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small/librispeech) | 3.98 | 82.9 | 483 |
|
| 80 |
-
| [WhisperKit/openai_whisper-base.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base.en/librispeech) | 4.76 | 75.5 | 145 |
|
| 81 |
-
| [WhisperKit/openai_whisper-base](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base/librispeech) | 6.11 | 67.1 | 145 |
|
| 82 |
-
| [WhisperKit/openai_whisper-tiny.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny.en/librispeech) | 6.72 | 64 | 66 |
|
| 83 |
-
| [WhisperKit/openai_whisper-tiny](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny/librispeech) | 8.94 | 52.4 | 66 |
|
| 84 |
-
|
| 85 |
-
|
| 86 |
### Glossary
|
| 87 |
|
| 88 |
- `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
|
|
|
|
| 13 |
|
| 14 |
|
| 15 |
|
| 16 |
+
## Dataset: `librispeech`
|
| 17 |
+
|
| 18 |
+
### Quality Evaluation
|
| 19 |
+
|
| 20 |
+
| | WER | QoI (%) | File Size (MB) |
|
| 21 |
+
|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|----------:|-----------------:|
|
| 22 |
+
| [WhisperOpenAIAPI/openai_whisper-large-v2](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2/librispeech) | 2.85 | 100 | 3100 |
|
| 23 |
+
| [WhisperKit/openai_whisper-large-v3](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3/librispeech) | 2.48 | 95.2 | 3100 |
|
| 24 |
+
| [WhisperKit/openai_whisper-large-v3_turbo](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3_turbo/librispeech) | 2.44 | 95.4 | 3100 |
|
| 25 |
+
| [WhisperKit/openai_whisper-large-v3_turbo_1018MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3_turbo_1018MB/librispeech) | 2.49 | 94.8 | 1018 |
|
| 26 |
+
| [WhisperKit/openai_whisper-large-v2](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2/librispeech) | 3.28 | 96.6 | 3100 |
|
| 27 |
+
| [WhisperKit/openai_whisper-large-v2_1050MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_1050MB/librispeech) | 3.32 | 95 | 1050 |
|
| 28 |
+
| [WhisperKit/openai_whisper-large-v2_turbo](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_turbo/librispeech) | 3.24 | 96.6 | 3100 |
|
| 29 |
+
| [WhisperKit/openai_whisper-large-v2_turbo_1022MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_turbo_1022MB/librispeech) | 3.33 | 94.9 | 1022 |
|
| 30 |
+
| [WhisperKit/openai_whisper-small.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small.en/librispeech) | 4.31 | 85.9 | 483 |
|
| 31 |
+
| [WhisperKit/openai_whisper-small](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small/librispeech) | 3.98 | 82.9 | 483 |
|
| 32 |
+
| [WhisperKit/openai_whisper-base.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base.en/librispeech) | 4.76 | 75.5 | 145 |
|
| 33 |
+
| [WhisperKit/openai_whisper-base](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base/librispeech) | 6.11 | 67.1 | 145 |
|
| 34 |
+
| [WhisperKit/openai_whisper-tiny.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny.en/librispeech) | 6.72 | 64 | 66 |
|
| 35 |
+
| [WhisperKit/openai_whisper-tiny](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny/librispeech) | 8.94 | 52.4 | 66 |
|
| 36 |
+
|
| 37 |
+
|
| 38 |
We believe that rigorously measuring the "quality of inference" is necessary for developers and
|
| 39 |
enterprises to make informed decisions when opting to use optimized or compressed variants of
|
| 40 |
any machine learning model in production. To contextualize `WhisperKit`, we take the following Whisper
|
| 41 |
implementations and benchmark them using a consistent evaluation harness:
|
| 42 |
|
| 43 |
+
Server-side:
|
| 44 |
+
- `WhisperOpenAIAPI`: [OpenAI's Whisper API](https://platform.openai.com/docs/guides/speech-to-text) ($0.36 per hour of audio as of 02/29/24, 25MB file size limit per request)
|
| 45 |
|
| 46 |
+
On-device:
|
| 47 |
+
- `WhisperKit`: Argmax's implementation [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L100) [[Repo]](https://github.com/argmaxinc/WhisperKit)
|
| 48 |
- `whisper.cpp`: A C++ implementation form ggerganov [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L212) [[Repo]](https://github.com/ggerganov/whisper.cpp)
|
| 49 |
- `WhisperMLX`: A Python implementation from Apple MLX [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L338) [[Repo]](https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py)
|
| 50 |
|
| 51 |
`WhisperOpenAIAPI` sets the reference and we assume that it is using the equivalent of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)
|
| 52 |
in float16 precision along with additional undisclosed optimizations from OpenAI. In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
|
| 53 |
+
which is a stricter metric compared to dataset average [Word Error RATE (WER)](https://en.wikipedia.org/wiki/Word_error_rate). A 100% `qoi` preserves perfect backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
|
| 54 |
where per-example known behavior changes after a code/model update and causes divergence in downstream code or breaks the user experience itself (even if dataset averages might stay flat
|
| 55 |
across updates). Pseudocode for `qoi`:
|
| 56 |
|
|
|
|
| 83 |
|
| 84 |
|
| 85 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
### Glossary
|
| 87 |
|
| 88 |
- `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
|