SpokeStack Android

I found quite interesting and seems I will get this version, I was playing around with RASA, but spokestack seems perfect for me. I have some questions tho, I’m trying to do follow the tutorials for android (NO Tray) and works ok, however I can’t find the way to set the ASR for Google cloud, any help?
I followed this

does not seem to work I checked the builder implementation but not sure , if I add this:
spokestack = Spokestack.Builder()
.setProperty(“google-credentials”, googleJson)

does not seems to work, I can create GoogleSpeechRecognizer(config) but there is no option to add it.
Another question about the models, do I need to download in my device always? or there is a easy way to use the URL instead the device?

could someone explain me what is spokestack cloud NLU? I need to download the files right? so how cloud nlu works? or what it is?
What will happen if I want to change my NLU model? add/remove things Do i need to “import” and then download again and release a new version?
and what is custom dictionary and pronunciation?

And last, when I use my NLU (using yaml from rasa), if I say something like “what is no ah random word” spokestack returns one of my “Intents” with around 0.8-0-9 confidence which is weird, I would like to say to the user “I could not understand rather than do the wrong action”
Thanks a lot , looking forward to implement this (working already on it)

Hi @zenyagami, and thanks for posting!

There are quite a few topics to cover here, so I’ll separate them into sections:

Changing ASR providers

The document you linked explains the credentials required for each provider, but setting the credentials themselves doesn’t change your provider from the default. This could be clearer in the docs, so thanks for bringing it to our attention.

The easiest way to change providers is to use one of our preconfigured profiles—in this case, TFWakewordGoogleASR if you’re using wakeword, or a different *GoogleASR profile if you’re not. Profiles are explained a bit in the speech pipeline documentation.

To use a profile in the Spokestack builder, you’ll want to keep your existing credentials/locale config and add


to your builder’s call chain.

Downloading models

I’m not sure which models you’re talking about in this section, but the same advice applies to both wake word and NLU: You can choose to either distribute the models directly with your app, in which case the app has to decompress them to an external directory when it launches for the first time, or it can download the models on first launch. Either choice is a one-time operation; once the models are there, they can be accessed from the same filesystem path on every startup.

Note, however, that if you store them in the app’s cache directory like we mention in some of our guides, the user can choose to clear the cache at will, forcing you to re-download/re-decompress, so you’ll need to check for the files’ existence before you use them.

Cloud NLU and changing models

Spokestack does have a cloud NLU service that runs inference without storing models locally, but we don’t currently include a component for it in our mobile libraries; models run entirely on-device like you’ve mentioned.

If you change your model, you’ll need your app to re-download it. How this is managed on the app side is up to you, though; it doesn’t necessarily require releasing a new version if you have another mechanism for periodically checking for new models and downloading them automatically.

and what is custom dictionary and pronunciation?

I’m not sure what this means in the context of NLU; could you give a little more information? Are you talking about TTS here? If so, see our TTS documentation—our mobile libraries don’t currently expose a custom dictionary feature, but you can customize pronunciation by using SSML or Speech Markdown as described there.

NLU performance

The way NLU works in general makes configuring it almost as much art as science. It’s hard to diagnose without a little more information about your configuration and the sample utterances you’re trying, but a phrase starting with “what is” does seem likely to resemble one or two intents in a lot of common configurations, so the confidence scores, while perhaps disappointing, don’t surprise me much.

It sounds like you want to include what’s often known as a “fallback intent” in your configuration, something that I don’t believe Rasa supports by default, but which you can create on your own and include in your YAML. A fallback intent contains phrases that have nothing to do with your application’s domain, so creating a good one really depends on the other intents you have in your configuration and how strict you want to be with matching them.

Other strategies for deciding that you don’t have a good match and returning an error response include looking at the confidence score and rejecting anything that’s under a certain threshold, or if two intents have very similar confidence scores (though using the confidence score doesn’t sound like an option for your case above) or rejecting NLU results if they’re missing a value for an important slot.

Hope some of that helps! Feel free to follow up if you have any other questions.


Thanls a lot for the reply, about the custom dictionary and pronunciation i saw in your “pricing” website and you added this as a “feature” for the maker plan so no I dea what it si.
I’m currenlty using "io.spokestack:spokestack-android:11.0.2", and
seems is not available,
i digged yesterday into the code and I could not find information.
is possible to use a listener to the wakeup keyword without using ASR?
right now I have my own implementation and state machine this is how it works:

command → go to NLU mapper, if required command is missing slot:
Ask user to give the data for the slot (using voice command)
User input-> do an action (I’m using a state machine so not need to use NLU)
I can have control and works great, and using
works great and do what I want, so I can keep my own Voice recognition (the user can change the voice language but keep the device in another language)
is any callback possible withour ASR? I want to display my own “UI” let’s say like a google
i found speechEvent launch “Acivated” event, however only when ASR is enabled, it would be awesome to have a callback for wakeup keyword without ASR. is this possible? I tried to dive into the sdk but I can’t find exactly how to make the trigger works. Thanks a lot and spokestack works great for me. I already implementing on my app and after I finish my models (for other languages) I will change to the maker subscription! Thanks a lot


Ah, that’s a few versions old now. withPipelineProfile() was introduced in 11.2.0; the current version is 11.4.1.

In 11.0.2, you can use


but you won’t be able to use that in the middle of a builder call chain; you’ll have to store the Spokestack builder in a variable to be able to call build() after you set the pipeline profile.

is possible to use a listener to the wakeup keyword without using ASR?

Yes. To see how, take a look at our speech pipeline profiles, for example the wake word/Google ASR profile. Notice that it’s pure configuration—which input class to use, a list of pipeline stages, and configuration properties that are relevant to those stages.

To do what you’re asking, you can either:

  1. Create your own profile class that omits the ASR stage and pass its full class name to withPipelineProfile or pipelineBuilder.useProfile as you would one of Spokestack’s built-in profiles, or
  2. Grab pipelineBuilder wherever you’re setting up Spokestack and call setInputClass/setStageClasses on it just like a profile does.

Now, Spokestack is designed to handle ASR for you, so it controls the microphone as long as the speech pipeline is running. Without an ASR stage, you’ll get ACTIVATE speech events when the wake word is detected, but you of course won’t get any speech transcripts (those come in PARTIAL_RECOGNIZE and RECOGNIZE events). You also might have to manually stop the pipeline (Spokestack.stop()) to allow an external ASR to use the microphone, then start() it again when the external ASR is finished so that the wake word can be recognized again.

In general, I’d recommend keeping all audio processing within the Spokestack pipeline, but of course I’m biased :slight_smile: . Sharing the microphone between different subsystems can be tricky, as we learned while implementing AndroidSpeechRecognizer. If you’re using Google Cloud ASR, you can set the language for ASR independently of the device’s language using the locale configuration property you mentioned in your first message.


A couple more notes I thought of after sending my earlier reply:

  1. If you do want to use an external ASR, you might take a look at using PreASRMicrophoneInput as your input class and calling Spokestack.deactivate() yourself when your ASR is complete. This might end up being more straightforward and efficient than using MicrophoneInput and calling start/stop all the time.
  2. If none of our *SpeechRecognizer classes does what you need, you can also wrap the ASR code you need in a class that implements SpeechProcessor and add that to the speech pipeline as a stage instead of using a Spokestack-provided class for ASR. See the existing *SpeechRecognizer classes (Android | Google | Azure) for examples. This would allow you to keep processing inside the pipeline so you don’t have to work around it for ASR.

We’ve considered adding a tutorial on how to create your own speech pipeline stage for awhile now; would that be helpful?


Thanks a lot, I Updated my version of spokestak (i used the one from the sample), i used
and just removed the recognition pipeline and works as expected, why I wanted this? I needed to setup this in the main screen and I’m triggering an event and launch the “voice UI” , and with the current pipeline I would not be able, and I have more control about to use Speech recognition based on internet capabilities/premium user and so on… but now it’s working apparently.
And I like that I have the control and I can use spokestack?.nlu?.classify(text) to get the NLU results as I would do it from spokestack pipeline which is awesome!!!
so far I’m happy I found spokestak and implementing this/downloads to create my maker account,
Just as a suggestions would be nice havin static methods for items like sample-rate, wake-detect-path or add this as method, can be a little bit complicated having this as map items.

And yes a pipeline stage would be nice!,
Just a headups for someone that might create the own pipeline, since the class is a string i had to add the @Kepp annotation in the class or proguard will remove this and the app wil crash

class CustomTFWakewordGoogleASR : PipelineProfile

Great, glad everything’s working for you—thanks for the feedback!

One thing I wanted to note: you’ve mentioned processing multiple languages, so you might be working with separate NLU models for different languages. Our current NLU framework is English-centric; it won’t refuse to build models in another language, but those models might not perform as well as English models. Multilingualism is something we’re actively working on.

For your note on configuration properties: we’ve considered dedicated methods for those, but that would lead to an explosion of methods on either the configuration object or the Spokestack builder that a new user would still have to realize are required and call one at a time to maintain the builder pattern, just as you currently do with the map key version. As someone who’s just gone through setup from scratch, can you give a bit more detail about how this would help you? Is it mainly for detecting misspellings in the property names?


Thanks a lot for the response, I think is a good practice and avoid misspelling errors if a class change maybe we will nto figure out easy what was wrong but overal is ok, I’m happy how it is now. Thanks a lot.
So if I want to upload another languages it will work? but maybe not as good right? is any thing I can do to help to improve this?
and last question, how should I “import” the NLU? I’m uploading rasa nlu yaml
which is something like :

- intent: search_place
  examples: |
    - Find [Restaurants](place) 
    - search for [gas stations](place)   
    - find the nearest [pharmacy](place) from me

the metadata seems getting the slots, however if i say “find the nearest pharmacy place from me” exactly as my intent, it does detect the search_place intent but without any slot, I’m kinda new in NLU and AI so maybe I’m missing something.
And last, if I uploaded for example “nlu_test.yaml” I can see it in my account but If i want delete/ or update the existing one, what can I do? should I uploaded with the same name? this will overwrite the old one? or I need always to upload a new one with different name? Thanks a lot !!

A model trained with non-English data should “work” as in “not throw any errors”, but I really can’t make any performance claims about it. Performance is partially dependent on the base data we train from, which is English for now.

We haven’t finished testing multilingual models yet, but if you’d like, we can try an experiment with one of your training files—I’ll send you a DM with details.

In general, you’ll want to provide as many examples of your slot values as you want to recognize, or at least as many as you can. Since the slots are done inline with examples in Rasa’s basic training data format, this can get repetitive, but it should help performance. You can also check out our in-house training data format if you’re interested in organizing things a little differently—we allow entities to be stored in separate files, which can be convenient if you’re working with a lot of different values.

If you’re testing with one of the examples from your training file, though, and the slot’s not being recognized, something else might be going on. I’ll ask for a bit more debugging info for that in the DM too.

If you keep the file name the same (capitalization doesn’t matter), a new upload should overwrite an old one, but as you’ve noticed, there’s no delete button for models you want to just get rid of. This is a UI issue on our end; I’ll bring it up with the team.

Awesome Thanks a lot , I’m moving to your NLU in house (my mistake i didn’t se this before) and i want to see if this improve the recognition instead rasa format. I have a question about the slots. the slots here are just examples right? for example if I want to search for a “place”/city i just need some examples like:

type = "entity"
values = [
    "Irish bar",
    "The mall",
    "gas station",
    "central park"

is that correct? thanks a lot looking forward to release this soon!! :smiley:

Yep, that looks right. If you want to maximize performance, though, or reuse the location entity across different intents, you might find that you want to include more samples in a separate file. This is documented in the entity section farther down the page.

One thing I see in that particular list of examples: that’s a pretty diverse mix of location names. There are generic nouns like “home” and “place”, a few company names, and generic location terms like “mall” and “airport”, as well as “Irish bar” which is a descriptor + a generic location. This might work as is, but it’s not a category with very obvious boundaries, so if you notice a lot of false positives (things getting picked up as locations when you don’t mean them to), you might consider adding more examples or even splitting them up into separate slots like personal_location, business_location, company_location or something similar and treating all those slots the same in your app.

You mention cities in your text too; I’d definitely split those out into a different slot.

Making a fallback intent right now!

Do you have any recommendations regarding:

  • How many utterances it should include?
  • How long these utterances should be ? (with regards to the length of the other intents perhaps?)
  • Should every word in these utterances sound totally different from the words in the other intents?

Can I use the AMAZON.FallbackIntent somehow? In their editor it seems to have no utterances though


I’ll answer the second one first: You can’t use Amazon’s fallback because, as you’ve noted, they don’t provide any training utterances. It’s possible they’re using a threshold on their model’s confidence to determine whether a given phrase should classify as one of your intents or the fallback. That’s something I’d recommend exploring with Spokestack too, probably before adding a new intent.

For an explicit fallback intent, here’s what I’d start with (experiment with these guidelines if they’re not working well for your specific app):

  • 10-20 utterances
  • different lengths (some a few words, some sentence-length, some longer than you expect your users’ utterances to be to capture the case where a user starts talking to someone else in the middle of talking to your app)
  • mostly different from the intents you want to capture, except for function words (things like pronouns and prepositions). This will vary a bit based on your app’s specific domain—if you have an air travel app, “book me a flight” might be included in one of your intents, but “he’s flighty” or “I had the best flight of beers yesterday” might be in your fallback.

Some of your fallback utterances can even be nonsense—ungrammatical sentences to capture ASR errors and give the model a class to return when it’s really unsure of the answer.

It’s also great if you have any logs of actual user conversations where a fallback should’ve fired: definitely include those.

Hope this helps—again, though, experimentation is key!


I will try to make a good fallback intent, because the NLU confuses me. I wanted to work with tresholds but for some reason the confidence is always super high, even if the sentences are totally different!

Some examples.
I/MainActivity: NLU classification: jump
intent: jump (confidence: 0.9975889)
stepNr: null
utterance: what’s going on
I/MainActivity: NLU classification: jump

I/MainActivity: intent: jump (confidence: 0.9998779)
stepNr: null
utterance: blue global religious
I/MainActivity: NLU classification: jump

I/MainActivity: intent: jump (confidence: 0.9802807)
stepNr: null
utterance: what’s up

It thought about the length as well, but they have similar lenghts.
The utterances for the jump intent:
“go to (number)”
“jump to (number)”
“navigate to (number)”

Am I missing something?

EDIT: Just realized that the numbers can go from 0 to 9999, so would that mean all the things that sound like one of those numbers will cause this high confidence?

In general, word sounds don’t factor into the equation, only word similarity—which is a function of the contexts the words are typically used in. In the case of integer slots, there is one extra caveat: the canon field will cause non-numeric words to be included in the generated training data as a hedge against potential ASR errors. For example, an utterance of I want {number} could generate training utterances of both I want two and I want to. You could try disabling that by setting canon = False in your number slot, but I don’t think that’s the problem here.

To be honest, I’m a bit surprised at these confidence values, especially the second one. How many other intents are in your model, and how different are they from the jump utterances?

Hi Josh,

“domain”: “skills”,
“intents”: [
“name”: “invokeProtocol”,
“description”: “”,
“implicit_slots”: [],
“slots”: []
“name”: “next”,
“description”: “”,
“implicit_slots”: [],
“slots”: []
“name”: “previous”,
“description”: “”,
“implicit_slots”: [],
“slots”: []
“name”: “repeat”,
“description”: “”,
“implicit_slots”: [],
“slots”: []
“name”: “jump”,
“description”: “”,
“implicit_slots”: [],
“slots”: [
“name”: “stepNr”,
“capture_name”: “stepNr”,
“description”: “”,
“type”: “integer”,
“facets”: “{“ordinals”: false, “range”: [0, 99999]}”
“name”: “makeANote”,
“description”: “”,
“implicit_slots”: [],
“slots”: []
“name”: “makeAStepNote”,
“description”: “”,
“implicit_slots”: [],
“slots”: []
“tags”: [

This is from my metadata file, I have made it with Alexa skills because I was under the impression that the Fallback intent that is built in there could be included. I see now that that is not the case. I can make the model again, but this time doing it the spokestack way with toml files and add that canon parameter.

Sounds like a plan. I’d be interested to know if it works better just by being written in TOML; perhaps there’s room for improvement in the Alexa conversion.

Hi Josh, making it in TOML indeed fixed that issue… super weird. Now we get a bit more “normal” values. Another thing was very interesting: When we changed the result.confidence from a VAL to a VAR, the results appeared to be much more natural. Could it be that it reiterates and that the VAL was blocking that since its immutable?

Definitely interesting…as for your second question, it depends how your code is structured. You shouldn’t be able to reassign something you’ve declared as a val. The library is written in Java and returns a completely new instance of NLUResult for each classification, so if you’re storing results somewhere for reuse, a var would probably be best.