-
Notifications
You must be signed in to change notification settings - Fork 12
Description
The Wreck a Nice Beach in the Browser: Getting the Browser to Recognize Speech talk by @kdavis-mozilla articulates the standardization struggle around the Web Speech API with focus on its speech recognition part.
My interpretation is there are two broad categories of issues for this API in terms of speech recognition:
- API design issues, for example:
The current Web Speech API reflects the times in which it was originally written about 10 years ago.
In particular, it doesn't make use of the subsequent advances in, for example, the Web Audio API.
- Privacy issues:
Questions of privacy that were present in the original API and new ones that arose since the original was written nip at the heels of standardization.
If speech recognition happens server side, as it does in the vast majority of cases, and your speech is retained to help train future speech recognition engines, as is now a standard in the industry, how is the GDPR right of erasure implemented?
How does the Web Speech API handle the issues of consent that arise when speech data is stored and reused server side?
The slide 10 summarizes the pros/cons of placing the speech recognition engine on the client vs. server.
It seems the industry at large is still undecided whether the speech recognition engine should sit on the client or on the server. The Web Speech API spec reflects that compromise. While the API design issues are generally easier to resolve, the privacy issues with their regulatory dimension are multifaceted and complex.
Questions:
- I'm wondering whether it'd be reasonable to revisit this API design consideration in the spec:
The API itself is agnostic of the underlying speech recognition and synthesis implementation and can support both server-based and client-based/embedded recognition and synthesis.
What if users could set a preference to only allow web sites to use the speech recognition feature if they can be confident their privacy is preserved? With advances in both DNN-based models and hardware accelerators for speech recognition embedded in modern clients, a client-side engine might be a pragmatic solution to the privacy issues?
How does a modern client-side engine perform in key UX metrics (latency, quality) in comparison to widely used server-based recognition solutions?