Best speech to text is already NN transformer based anyway, so in theory it's only better to use a combined model