Accuracy of different ASR providers

Hi!

We have a rather odd problem. We are using the Azure cloud ASR from Microsoft, because we did some benchmarking tests by sending (phone) recorded sentences here:

It gave great results and were eager to start working straight from spokestack with the TFWakewordAzureASR profile. The weird thing is, in the application the accuracy was horrible. We first thought it was a config or mic problem within our app. But we then downloaded the spokestack-android-control-room, which showed very accurate results with the Android ASR (which is supposed to be way less accurate), and horrible results again with the Azure ASR.

Very confused now :stuck_out_tongue:

If anyone is seeing something we are missing, would be greatly appreciated

First, a quick note on this. I’d be interested to know where you found the accuracy comparison, but in general, don’t trust the published accuracy numbers. They often come from lab trials/tests on clean data, and your mileage will almost certainly vary.

I had a longer response ready to go, but then I realized I should ask a question before sending it: what language/locale did you use for your benchmark testing, and does it match the language set on your phone?

We did benchmarking in our own environments! But I get your point. We use the Azure ASR mainly for its dictionary anyway :slight_smile:

My phone is set to English for exactly this point (I realized that de Android ASR would automatically switch to recognizing dutch instead of the intended English). The Azure ASR is set to en-US, and westeurope as region.

Ah, gotcha. I think it was the “supposed to be” that threw me off—I thought you were referring to accuracy numbers published somewhere.

I thought that if you were using a locale other than en-US, you might be running into a bug in our locale handling for Azure, and that might still be the case since I don’t know what locale they use by default in the westeurope region. We’ll publish a new version of the library soon that should address this.

Aside from that, though, here are a couple other thoughts:

  • You said you used “recorded” sentences for your initial tests. Doing ASR on full utterances is different enough from streaming the results as audio comes in that MS might be using a different model or different scorer for the two tasks, and if so, the one used for full utterances would likely be more heavyweight (read: more accurate). If you have an easy way to benchmark streaming ASR outside of Spokestack, that would be a more direct comparison.

  • MS may have changed things since Spokestack’s Azure recognizer was written/tested in such a way that our audio preprocessing is negatively affecting its performance. You could try copying the TFWakewordAzureASR profile into a new class and remove the following stages:

    • AutomaticGainControl
    • AcousticNoiseSuppressor

    and the pre-emphasis configuration property. If ASR works better for you without them, try adding them back in one by one to see if you can identify one stage that’s the problem. Note that leaving some or all of them all out might negatively impact your wake word’s performance.

If I had to guess, I’d say a difference in performance would be more likely due to the first point than the second, but I try not to rule anything out.

An update to this: I’ve been looking at the Azure ASR component today and have found similar accuracy problems no matter what I try (including, but not limited to, the profile adjustments I mentioned above). I can get it to work as expected if I speak clearly/loudly enough, so it doesn’t seem to be a technical issue with the way we’re streaming the audio to them. As of now, I assume they simply have a bad model, scorer, or API backend in production, but there’s still a chance we’re doing something wrong, or that there’s some device-specific preprocessing we could be doing to improve the performance on various devices. The latter is a good reason the Android ASR can easily outperform third-party providers—they can tune their models to specific devices transparently to the user.

You can, of course, copy the AzureSpeechRecognizer class, modify it, and include your customized version in your pipeline; if you come up with anything that improves performance, we’d be happy to include it in the repository. Here’s Microsoft’s quickstart, but their documentation is spread out over several sections of the site.

1 Like

So we found out that sending a whole file through the SDK gives way better results then streaming it! Maybe they use different models, or for some reason postprocessing doesn’t work for streaming. Hope this helps!

2 Likes

Thanks for reporting back and confirming! It’s unfortunate that the whole-file approach doesn’t give the same opportunities for a responsive UI (there are no partial results to display, and all the network latency happens after the user stops speaking). Spokestack is built around the streaming paradigm, so I’m not sure whether we’ll add a separate component for collecting files to send through ASR, but we’ll consider it if it’s widely useful.

1 Like