german-stt-evaluation

Evaluation of STT models for german language

In search of a “good” STT model for German language I have evaluated all free (as in free beer and open source) models.

tl;dr As of January 2022 NeMo-ASRs Conformer-Transducer model is the overall leader (WER 5.77 / CER 1.46) on GPU, while Jaco-Assistant/Scribosermo model is still a very good choice for CPU (WER 9.43 / CER 3.66).

Vendor / Architecture	Model	WER	CER	RTF	Comment
Jaco-Assistant / Scribosermo	full / Scorer: D37CV	9.43	3.66	0.078	CPU 8 cores
Jaco-Assistant / Scribosermo	quantized / Scorer: D37CV	9.51	3.70	0.096	CPU 8 cores
Mozilla DeepSpeech	deepspeech-german v0.9.0	27.93	11.36	0.209
Mozilla DeepSpeech	Polyglot	14.45	11.36	0.241
Silero	v4 large	18.98	6.67	0.009	RTF is not a typo
Wav2Vec	jonatasgrosman / wav2vec2-large-xlsr-53-german	10.87	2.68	0.06	Batchsize 1
Vosk	0.21	12.84	4.56	0.292
Nvidia NeMo-ASR	Conformer-CTC 1.5.0	7.39	1.80	0.064	GPU w/Apex-AMP
Nvidia NeMo-ASR	Conformer-Transducer 1.6.0	5.77	1.46	0.127	GPU w/Apex-AMP
Nvidia NeMo-ASR	Conformer-Transducer 1.5.0	6.20	1.62	0.124	GPU w/Apex-AMP
Nvidia NeMo-ASR	Citrinet-1024 1.5.0	8.24	2.32	0.069	GPU w/Apex-AMP
Nvidia NeMo-ASR	Contextnet-1024 1.4.0	6.68	1.77	0.098	GPU w/Apex-AMP
Nvidia NeMo-ASR	Quartznet-15x15 1.0.0rc1	13.23	3.53	0.064	GPU w/Apex-AMP

Conclusion

For GPU NeMo-ASRs models are leader of the pack. The Conformer-Transducer model gives you best WER and CER, the Contextnet-1024 and Conformer-CTC models are runner up with still very good values and even better RTF than the Transducer model.

On CPU both Jaco-Assistant/Scribosermo models - full and quantized - give you good WER/CER values and good performance. (Note: Jaco website claims WER 7.5% while I got “only” 9.4%). Silero is blazing fast but WER of 19% makes it impractical for daily use.

Notes on methodology

Word error rate (WER) and character error rate (CER) were calculated (with PyPi-package jiwer==2.2.0) on the Common-Voice test-dataset provided by Huggingface (huggingface/common_voice/de/6.1.0 retrieved with PyPi-package datasets==1.13.3). The real time factor (RTF) has been calculated by running inference on the first 1,000 records of the same dataset as above. Pre- and post-processing times (loading audio files, sample rate conversion, normalizing results, etc.) were excluded.

Evaluation was performed on a Nvidia Xavier AGX 32GB with JetPack 4.6, MAXN mode and jetson-clocks enabled.

You like this page? Then don’t be shy and go to the repository and click the star-button: Star