In search of a “good” STT model for German language I have evaluated all free (as in free beer and open source) models.
tl;dr As of January 2022 NeMo-ASRs Conformer-Transducer model is the overall leader (WER 5.77 / CER 1.46) on GPU, while Jaco-Assistant/Scribosermo model is still a very good choice for CPU (WER 9.43 / CER 3.66).
Vendor / Architecture | Model | WER | CER | RTF | Comment |
---|---|---|---|---|---|
Jaco-Assistant / Scribosermo | full / Scorer: D37CV | 9.43 | 3.66 | 0.078 | CPU 8 cores |
Jaco-Assistant / Scribosermo | quantized / Scorer: D37CV | 9.51 | 3.70 | 0.096 | CPU 8 cores |
Mozilla DeepSpeech | deepspeech-german v0.9.0 | 27.93 | 11.36 | 0.209 | |
Mozilla DeepSpeech | Polyglot | 14.45 | 11.36 | 0.241 | |
Silero | v4 large | 18.98 | 6.67 | 0.009 | RTF is not a typo |
Wav2Vec | jonatasgrosman / wav2vec2-large-xlsr-53-german | 10.87 | 2.68 | 0.06 | Batchsize 1 |
Vosk | 0.21 | 12.84 | 4.56 | 0.292 | |
Nvidia NeMo-ASR | Conformer-CTC 1.5.0 | 7.39 | 1.80 | 0.064 | GPU w/Apex-AMP |
Nvidia NeMo-ASR | Conformer-Transducer 1.6.0 | 5.77 | 1.46 | 0.127 | GPU w/Apex-AMP |
Nvidia NeMo-ASR | Conformer-Transducer 1.5.0 | 6.20 | 1.62 | 0.124 | GPU w/Apex-AMP |
Nvidia NeMo-ASR | Citrinet-1024 1.5.0 | 8.24 | 2.32 | 0.069 | GPU w/Apex-AMP |
Nvidia NeMo-ASR | Contextnet-1024 1.4.0 | 6.68 | 1.77 | 0.098 | GPU w/Apex-AMP |
Nvidia NeMo-ASR | Quartznet-15x15 1.0.0rc1 | 13.23 | 3.53 | 0.064 | GPU w/Apex-AMP |
Conclusion
For GPU NeMo-ASRs models are leader of the pack. The Conformer-Transducer model gives you best WER and CER, the Contextnet-1024 and Conformer-CTC models are runner up with still very good values and even better RTF than the Transducer model.
On CPU both Jaco-Assistant/Scribosermo models - full and quantized - give you good WER/CER values and good performance. (Note: Jaco website claims WER 7.5% while I got “only” 9.4%). Silero is blazing fast but WER of 19% makes it impractical for daily use.
Notes on methodology
Word error rate (WER) and character error rate (CER) were calculated (with PyPi-package jiwer==2.2.0) on the Common-Voice test-dataset provided by Huggingface (huggingface/common_voice/de/6.1.0 retrieved with PyPi-package datasets==1.13.3). The real time factor (RTF) has been calculated by running inference on the first 1,000 records of the same dataset as above. Pre- and post-processing times (loading audio files, sample rate conversion, normalizing results, etc.) were excluded.
Evaluation was performed on a Nvidia Xavier AGX 32GB with JetPack 4.6, MAXN mode and jetson-clocks enabled.
You like this page? Then don’t be shy and go to the repository and click the star-button: Star