Section 4.4 End-to-End Speech Synthesis

So it seems that if you train the vcoder on the predicted mel-spectrograms of the text-to-wave model (Tacotron2) you get better results, right?

Using this:
https://github.com/jik876/hifi-gan

The mel dataset creator, returns the following

```

(mel.squeeze(), audio.squeeze(0), filename, mel_loss.squeeze())

```

In the training it looks as follows:

```
x, y, _, y_mel = batch
```

But if not fine-tuning, then x and y_mel are the same. Where can I look in the paper to better understand this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Section 4.4 End-to-End Speech Synthesis #169

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Section 4.4 End-to-End Speech Synthesis #169

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions