-
Notifications
You must be signed in to change notification settings - Fork 536
Open
Description
So it seems that if you train the vcoder on the predicted mel-spectrograms of the text-to-wave model (Tacotron2) you get better results, right?
Using this:
https://github.com/jik876/hifi-gan
The mel dataset creator, returns the following
(mel.squeeze(), audio.squeeze(0), filename, mel_loss.squeeze())
In the training it looks as follows:
x, y, _, y_mel = batch
But if not fine-tuning, then x and y_mel are the same. Where can I look in the paper to better understand this?
Metadata
Metadata
Assignees
Labels
No labels