german-stt-evaluation

Evaluation of STT models for german language


In search of a “good” STT model for German language I have evaluated all free (as in free beer and open source) models.

tl;dr As of January 2022 NeMo-ASRs Conformer-Transducer model is the overall leader (WER 5.77 / CER 1.46) on GPU, while Jaco-Assistant/Scribosermo model is still a very good choice for CPU (WER 9.43 / CER 3.66).

Vendor / Architecture Model WER CER RTF Comment
Jaco-Assistant / Scribosermo full / Scorer: D37CV 9.43 3.66 0.078 CPU 8 cores
Jaco-Assistant / Scribosermo quantized / Scorer: D37CV 9.51 3.70 0.096 CPU 8 cores
Mozilla DeepSpeech deepspeech-german v0.9.0 27.93 11.36 0.209  
Mozilla DeepSpeech Polyglot 14.45 11.36 0.241  
Silero v4 large 18.98 6.67 0.009 RTF is not a typo
Wav2Vec jonatasgrosman / wav2vec2-large-xlsr-53-german 10.87 2.68 0.06 Batchsize 1
Vosk 0.21 12.84 4.56 0.292  
Nvidia NeMo-ASR Conformer-CTC 1.5.0 7.39 1.80 0.064 GPU w/Apex-AMP
Nvidia NeMo-ASR Conformer-Transducer 1.6.0 5.77 1.46 0.127 GPU w/Apex-AMP
Nvidia NeMo-ASR Conformer-Transducer 1.5.0 6.20 1.62 0.124 GPU w/Apex-AMP
Nvidia NeMo-ASR Citrinet-1024 1.5.0 8.24 2.32 0.069 GPU w/Apex-AMP
Nvidia NeMo-ASR Contextnet-1024 1.4.0 6.68 1.77 0.098 GPU w/Apex-AMP
Nvidia NeMo-ASR Quartznet-15x15 1.0.0rc1 13.23 3.53 0.064 GPU w/Apex-AMP

Conclusion

For GPU NeMo-ASRs models are leader of the pack. The Conformer-Transducer model gives you best WER and CER, the Contextnet-1024 and Conformer-CTC models are runner up with still very good values and even better RTF than the Transducer model.

On CPU both Jaco-Assistant/Scribosermo models - full and quantized - give you good WER/CER values and good performance. (Note: Jaco website claims WER 7.5% while I got “only” 9.4%). Silero is blazing fast but WER of 19% makes it impractical for daily use.

Notes on methodology

Word error rate (WER) and character error rate (CER) were calculated (with PyPi-package jiwer==2.2.0) on the Common-Voice test-dataset provided by Huggingface (huggingface/common_voice/de/6.1.0 retrieved with PyPi-package datasets==1.13.3). The real time factor (RTF) has been calculated by running inference on the first 1,000 records of the same dataset as above. Pre- and post-processing times (loading audio files, sample rate conversion, normalizing results, etc.) were excluded.

Evaluation was performed on a Nvidia Xavier AGX 32GB with JetPack 4.6, MAXN mode and jetson-clocks enabled.

You like this page? Then don’t be shy and go to the repository and click the star-button: Star