-
Notifications
You must be signed in to change notification settings - Fork 35
Description
I'll list here some of my findings so that you can challenge them if you've got better ideas. Most of them are breaching the architecture we first thought of during the dedicated call.
-
In continuous playback, you cannot set consecutive chunks of the same audio file in your audio engine's playlist. Timestamps are not precise enough so some frames are skipped and you've got noise.
-
Automatically pausing after a chunk of audio files prevents from pausing at the orchestration level. The player can only reports progress at discrete intervals so that cannot be precise. You could measure time before pausing instead but I'm afraid this won't be very reliable. At least on Android there's a lot of thread switching in addition to native-JVM communication before the player actually executes a command or your callbacks get called.
-
Because of the same latency issues between the orchestration and the player, fine-grained silence control (200ms up to 2s) is not possible at the orchestration level.
-
General feeling induced by the three previous points: to limit latency, the playback should be handled by the minimal amount of different components and we should use advanced features with lower latency optimizations when available instead of emulating it at the orchestration level. Ideally, there is a single audio player for the whole content and it must know ahead of time when to pause and when to add silences.
-
For all the reasons above, we should use audio content playback when possible instead of "live" TTS engines. This is possible on mobile. Amongst additional advantages there are: a clear difference between the TTS accessibility queue and the publication content which will behave as any media and the ability to skip a given number of seconds.
-
The engine adapter should not be responsible for the presynthesis orchestration. Let's make equal the concepts of engine and voice as I always do because an engine always generates content with a single specific voice. As we must be able to change the voice at any place in the content (for instance for a single word), we need a presynthesis orchestration at the global level. Besides, a voice can let the orchestration roughly know how fast it is to synthesize content. Having presynthesis orchestration both inside the engine and outside it would be redundant.