Skip to content

Voice assistant example - the "command" tool #171

@ggerganov

Description

@ggerganov

There seems to be significant interest for a voice assistant application of Whisper, similar to "Ok, Google", "Hey Siri", "Alexa", etc. The existing stream tool is not very applicable for this use case, because the voice assistant commands are usually short (i.e. play some music, turn on the TV, kill all humans, feed the baby, etc), while stream expects a continuous stream of speech.

Therefore, implement a basic command-line tool called command that does the following:

  • Upon start, asks the person to say a "key phrase". The phrase should be an average sentence that normally takes 2-3 seconds to pronounce. We want to have enough "training" data of the person's voice
  • If the transcribed text matches the expected phrase, then we "remember" this audio and use it later. Else, we ask to say it again until we succeed
  • We start listening continuously for voice activity using my VAD detector that I implemented for talk.wasm - I think it works very well given it's simplicity
  • When we detect speech, we prepend the recorded key-phrase to the last 2-3 seconds of the live audio and transcribe
  • The result should be: [key phrase][command], so by knowing the key phrase we can extract only the [command]

This should work in Web and Raspberry Pi and thanks to the VAD, it will be energy efficient.
Should be a good starting example for creating a voice assistant.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ideasInteresting ideas for experimentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions