Feature request (Android)

Hello I’m happy that i found sopkestack and I already started implementing in my app and so far is great and I will improve with the future, seems it works ok and I understand the “pipeline” idea so the devs just need to “activate” and spokestack will do everything , sadly (at least for me and i think several devs) is not scalable enough to have full control (I will explain why…)
I would like to request separate in 3 modules:

  • NLU recognition
  • wakeup word
  • spokestack TTS

unfortunately all this 3 relies in one initialization and the "callbacks must has be done in this object for me is hard because if I want to use spokestack TTS and the files for NLU are not downlowaded yet i need to initialize but avoid the files and is quite complex task, same for wakeup keyword
i want to use coroutines and have a base TTS like aimybox or use clean architecture but is a little bit hard if the initialization of the spokestack handle the callback, i can add a callback from another class but i need to make sure to cleanup and this can be in the future complicated, so would be nice having three independent modules and isolate this
another example is wakeup keyword, for example, in my pixel 5 there is a bug (from google) that does not make the “bip” from voice activation, with the pipeline wakeup keyword I can handle the “recognition” and show the ui or process or do something else before activating the microphone ,right now i found a workaround but I think it will be better if this 3 components could work independently from one spokestack pipeline.

for example i did this for TTS from azure

class AzureCloudTextToSpeech(context: Context, azureKeys: String) : BaseTextToSpeech(context) {

    private val client = SpeechConfig.fromSubscription(azureKeys, "eastus").apply {
        speechSynthesisLanguage = Locale.getDefault().toLanguageTag()
    }
    val synthesizer = SpeechSynthesizer(client)

    override suspend fun speak(speech: TextSpeech) {

        suspendCancellableCoroutine<Unit> { continuation ->
            continuation.invokeOnCancellation { synthesizer.StopSpeakingAsync() }
            try {
                val result = synthesizer.SpeakText(speech.text)
                when (result.reason) {
                    ResultReason.SynthesizingAudioCompleted -> {
                        continuation.resume(Unit)
                    }
                    ResultReason.Canceled -> {
                        val message =
                            SpeechSynthesisCancellationDetails.fromResult(result).toString()
                        continuation.resumeWithException(
                            AzurePlatformTextToSpeechException(
                                null,
                                message
                            )

                        )
                    }
                    else -> {
                        continuation.resume(Unit)
                    }
                }
            } catch (ex: Exception) {
                continuation.resumeWithException(
                    AzurePlatformTextToSpeechException(
                        null,
                        "unexpected error: ${ex.message}",
                        ex
                    )
                )
            }

        }
    }

    override suspend fun stop() {
        super.stop()
        synthesizer.StopSpeakingAsync()
    }
}

this code is similar found it in amibox open source, so I liked the idea and i can choose between differents components and I could do something similar with spokestack but would be complicated and probably some memory leak at some point.
Would be great to have this.
also i noticed the spokestack TextToSpeech is using OKhttp, would be great using custom httpclient, why? because I saw for example 5 seconds timeout, when we wait that long for a voice assistant this might be a loooot for the user maybe sometimes i can setup 1sec other clients3 and so on, right now I can’t do this.

having a spokestack tts indepentant will help me to cleanup resources and handle better in the app without affecting the “pipeline” Thanks alot,
I hope i could explain and I can hep you if you need further help/explanations. Thanks a lot and great work
Best

Modularity

I have good news for you :slight_smile:—Spokestack already works like this!

The Spokestack wrapper is a convenience class that’s designed to make it easy to experiment with voice and prototype without a bunch of complicated multi-step setup. It makes a “quickstart guide” possible. The underlying system, though, was designed to be modular from the start.

If you examine the setup code in Spokestack, you’ll see that under the hood, it’s just wiring together three separate modules: SpeechPipeline, NLUManager, and TTSManager. Each module has its own builder, and they all work in a similar fashion, with a SpeechConfig and a listener that receives events relevant to that module. You can completely ignore the Spokestack class and use these three modules as independent systems.

Another thing you’ll notice about those modules, though, is that they use class names for setup just like the speech pipeline profiles you’ve encountered. This is done so that it’s straightforward to add your own classes as plugins and still use the Spokestack API to manage your voice interaction rather than, say, using Spokestack’s TTS and its API to manage English TTS, but having completely different code to handle a different language. If you wrap a third-party TTS system in classes that implement the Spokestack TTS interfaces, you can provide those classes to TTSManager and keep everything using the same API.

As a final note, you might also experiment with including certain models (the wake word/keyword models are especially lightweight) in your app’s APK so you don’t have to wait for those downloads at all. I do understand wanting to keep APK size down, though.

TTS client issues

Having a configurable timeout in the TTS client is perfectly reasonable; I’ll add a GitHub issue for making that an option. Contributions/PRs are welcome! I don’t think it’d require using an http client other than OKHttp, though; can you elaborate on that a bit?

To add a bit of info to my section on modularity, since the Spokestack class is a relatively recent addition, a lot of our documentation references the individual modules; here are links to the speech pipeline, NLU (though it still mentions the first incarnation of NLU, TensorflowNLU instead of the more general NLUManager; we’ll update that soon), and TTS.