Custom on-cloud ASR Integration


I am currently trying to integrate an on-cloud ASR such that the wakeword model triggers a recording, which consequently is sent to the cloud, and then the response forwarded back into the pipeline triggering the NLU stage within Spokestack. I plan to record through Flutter after didActivate() has been called, and then once the response has been retrieved, pass the data into the NLU. I cannot however, figure out what syntax to follow for the data structure (JSON, simple , yaml, toml, text, inline text etc) and how to pass this data into the NLU.I am developing a Flutter plugin for this, so the implementation has to be done both on Android and iOS. Our NLU Model is being transitioned to Azure, which I’ve noticed is not supported by Spokestack hence the on-cloud alternative. Any feedback is highly appreciated! :slight_smile:

Going through the source-code (spokestack-ios/NLUService.swift at master · spokestack/spokestack-ios · GitHub) I have found the following function def:
@objc func classify(utterance: String, context: [String : Any]) -> Void
I presume the parameter utterance, in this case, refers to a raw string output that would be coming from an ASR? Would I be correct to assume this is where I pass the data, and what/why do we need to pass a context alongide it?

Hi @RayElementa, and welcome!

There are a few different topics here, so let me try to address them one at a time:

In general, running ASR on a recording is going to be a less natural user experience than streaming the audio to the ASR provider (you have to wait until the user is done speaking to start ASR, and you can’t display partial results). It’s doable this way, but I’d recommend streaming.

You might run into microphone sharing problems, especially on Android, if you try to use Spokestack for part of your speech processing then break out of it to record the utterance itself. In Spokestack, speech is processed in “stages”; I’d recommend looking at how pipeline stages are designed and following the example of some existing ASR stages (Android | iOS) to write one that does what you want.

You say here that you want to use Spokestack for NLU, but later on you say that you’re transitioning your NLU to Azure. Do you mean that you’re using Azure for ASR, or NLU? If it’s ASR, we already support Azure ASR on Android (code). You’d be looking for the TFWakewordAzureASR pipeline profile. It should be possible to port this code to Swift using Microsoft’s SDK to make the iOS version. We’d be happy to have PRs to add new ASR providers into Spokestack.

Yes, classify is the main interface for performing manual NLU, as covered in our guides (unified Spokestack interface for Android | standalone NLU on Android | standalone NLU on iOS). An utterance is the ASR result, and context is an optional map that can be used to pass extra information to third-party NLU services that can make use of it. It’s not currently used by Spokestack NLU, but it’s included in the interface so that we don’t have to make a breaking change to include utterance context in the future. Your link is to the protocol for NLU, not the iOS NLU implementation, which provides a default value for the context.

Since you’re writing a Flutter plugin, it might be useful to take a look at React Native Spokestack to see how an existing cross-platform implementation makes use of both native libraries.

Hope this helps.