|
|
|
|
|
by jilijeanlouis
94 days ago
|
|
Author here. We built this because we kept seeing different word error rates (WER) for the same models depending on who was testing and how. Normalization rules ended up being a big reason why this was happening, so we decided to release a fully reproducible evaluation framework. You can test it yourself with our full repo. It includes: Normalization rules we use; Scoring scripts; Dataset coverage (conversational, noisy, multilingual); Full eval pipeline We also published a detailed comparison using this framework across 8 leading STT providers, 7 datasets, and 74 hours of audio. You can see it here: https://www.gladia.io/competitors/benchmarks Feedback welcomed! |
|