Skip to content

Commit cb79fb9

Browse files
lym0302luotao1
authored andcommitted
[TTS] add opencpop PWGAN example (PaddlePaddle#3031)
* add opencpop voc, test=tts * soft link
1 parent 4b17e83 commit cb79fb9

File tree

12 files changed

+573
-2
lines changed

12 files changed

+573
-2
lines changed

examples/opencpop/voc1/README.md

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Parallel WaveGAN with Opencpop
2+
This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [Mandarin singing corpus](https://wenet.org.cn/opencpop/).
3+
4+
## Dataset
5+
### Download and Extract
6+
Download Opencpop from it's [Official Website](https://wenet.org.cn/opencpop/download/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/Opencpop`.
7+
8+
## Get Started
9+
Assume the path to the dataset is `~/datasets/Opencpop`.
10+
Run the command below to
11+
1. **source path**.
12+
2. preprocess the dataset.
13+
3. train the model.
14+
4. synthesize wavs.
15+
- synthesize waveform from `metadata.jsonl`.
16+
```bash
17+
./run.sh
18+
```
19+
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
20+
```bash
21+
./run.sh --stage 0 --stop-stage 0
22+
```
23+
### Data Preprocessing
24+
```bash
25+
./local/preprocess.sh ${conf_path}
26+
```
27+
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
28+
29+
```text
30+
dump
31+
├── dev
32+
│ ├── norm
33+
│ └── raw
34+
├── test
35+
│ ├── norm
36+
│ └── raw
37+
└── train
38+
├── norm
39+
├── raw
40+
└── feats_stats.npy
41+
```
42+
The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
43+
44+
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
45+
46+
### Model Training
47+
```bash
48+
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
49+
```
50+
`./local/train.sh` calls `${BIN_DIR}/train.py`.
51+
Here's the complete help message.
52+
53+
```text
54+
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
55+
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
56+
[--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
57+
[--run-benchmark RUN_BENCHMARK]
58+
[--profiler_options PROFILER_OPTIONS]
59+
60+
Train a ParallelWaveGAN model.
61+
62+
optional arguments:
63+
-h, --help show this help message and exit
64+
--config CONFIG ParallelWaveGAN config file.
65+
--train-metadata TRAIN_METADATA
66+
training data.
67+
--dev-metadata DEV_METADATA
68+
dev data.
69+
--output-dir OUTPUT_DIR
70+
output dir.
71+
--ngpu NGPU if ngpu == 0, use cpu.
72+
73+
benchmark:
74+
arguments related to benchmark.
75+
76+
--batch-size BATCH_SIZE
77+
batch size.
78+
--max-iter MAX_ITER train max steps.
79+
--run-benchmark RUN_BENCHMARK
80+
runing benchmark or not, if True, use the --batch-size
81+
and --max-iter.
82+
--profiler_options PROFILER_OPTIONS
83+
The option of profiler, which should be in format
84+
"key1=value1;key2=value2;key3=value3".
85+
```
86+
87+
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
88+
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
89+
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
90+
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
91+
92+
### Synthesizing
93+
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
94+
```bash
95+
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
96+
```
97+
```text
98+
usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
99+
[--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
100+
[--output-dir OUTPUT_DIR] [--ngpu NGPU]
101+
102+
Synthesize with GANVocoder.
103+
104+
optional arguments:
105+
-h, --help show this help message and exit
106+
--generator-type GENERATOR_TYPE
107+
type of GANVocoder, should in {pwgan, mb_melgan,
108+
style_melgan, } now
109+
--config CONFIG GANVocoder config file.
110+
--checkpoint CHECKPOINT
111+
snapshot to load.
112+
--test-metadata TEST_METADATA
113+
dev data.
114+
--output-dir OUTPUT_DIR
115+
output dir.
116+
--ngpu NGPU if ngpu == 0, use cpu.
117+
```
118+
119+
1. `--config` parallel wavegan config file. You should use the same config with which the model is trained.
120+
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
121+
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
122+
4. `--output-dir` is the directory to save the synthesized audio files.
123+
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
124+
125+
## Pretrained Models
126+
The pretrained model can be downloaded here:
127+
- [pwgan_opencpop_ckpt_1.4.0](https://paddlespeech.bj.bcebos.com/t2s/svs/opencpop/pwgan_opencpop_ckpt_1.4.0.zip)
128+
129+
130+
Parallel WaveGAN checkpoint contains files listed below.
131+
132+
```text
133+
pwgan_opencpop_ckpt_1.4.0
134+
├── default.yaml # default config used to train parallel wavegan
135+
├── snapshot_iter_100000.pdz # generator parameters of parallel wavegan
136+
└── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
137+
```
138+
## Acknowledgement
139+
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# This is the hyperparameter configuration file for Parallel WaveGAN.
2+
# Please make sure this is adjusted for the CSMSC dataset. If you want to
3+
# apply to the other dataset, you might need to carefully change some parameters.
4+
# This configuration requires 12 GB GPU memory and takes ~3 days on RTX TITAN.
5+
6+
###########################################################
7+
# FEATURE EXTRACTION SETTING #
8+
###########################################################
9+
fs: 24000 # Sampling rate.
10+
n_fft: 512 # FFT size (samples).
11+
n_shift: 128 # Hop size (samples). 12.5ms
12+
win_length: 512 # Window length (samples). 50ms
13+
# If set to null, it will be the same as fft_size.
14+
window: "hann" # Window function.
15+
n_mels: 80 # Number of mel basis.
16+
fmin: 30 # Minimum freq in mel basis calculation. (Hz)
17+
fmax: 12000 # Maximum frequency in mel basis calculation. (Hz)
18+
19+
20+
###########################################################
21+
# GENERATOR NETWORK ARCHITECTURE SETTING #
22+
###########################################################
23+
generator_params:
24+
in_channels: 1 # Number of input channels.
25+
out_channels: 1 # Number of output channels.
26+
kernel_size: 3 # Kernel size of dilated convolution.
27+
layers: 30 # Number of residual block layers.
28+
stacks: 3 # Number of stacks i.e., dilation cycles.
29+
residual_channels: 64 # Number of channels in residual conv.
30+
gate_channels: 128 # Number of channels in gated conv.
31+
skip_channels: 64 # Number of channels in skip conv.
32+
aux_channels: 80 # Number of channels for auxiliary feature conv.
33+
# Must be the same as num_mels.
34+
aux_context_window: 2 # Context window size for auxiliary feature.
35+
# If set to 2, previous 2 and future 2 frames will be considered.
36+
dropout: 0.0 # Dropout rate. 0.0 means no dropout applied.
37+
bias: True # use bias in residual blocks
38+
use_weight_norm: True # Whether to use weight norm.
39+
# If set to true, it will be applied to all of the conv layers.
40+
use_causal_conv: False # use causal conv in residual blocks and upsample layers
41+
upsample_scales: [8, 4, 2, 2] # Upsampling scales. Prodcut of these must be the same as hop size.
42+
interpolate_mode: "nearest" # upsample net interpolate mode
43+
freq_axis_kernel_size: 1 # upsamling net: convolution kernel size in frequencey axis
44+
nonlinear_activation: null
45+
nonlinear_activation_params: {}
46+
47+
###########################################################
48+
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
49+
###########################################################
50+
discriminator_params:
51+
in_channels: 1 # Number of input channels.
52+
out_channels: 1 # Number of output channels.
53+
kernel_size: 3 # Number of output channels.
54+
layers: 10 # Number of conv layers.
55+
conv_channels: 64 # Number of chnn layers.
56+
bias: True # Whether to use bias parameter in conv.
57+
use_weight_norm: True # Whether to use weight norm.
58+
# If set to true, it will be applied to all of the conv layers.
59+
nonlinear_activation: "leakyrelu" # Nonlinear function after each conv.
60+
nonlinear_activation_params: # Nonlinear function parameters
61+
negative_slope: 0.2 # Alpha in leakyrelu.
62+
63+
###########################################################
64+
# STFT LOSS SETTING #
65+
###########################################################
66+
stft_loss_params:
67+
fft_sizes: [1024, 2048, 512] # List of FFT size for STFT-based loss.
68+
hop_sizes: [120, 240, 50] # List of hop size for STFT-based loss
69+
win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
70+
window: "hann" # Window function for STFT-based loss
71+
72+
###########################################################
73+
# ADVERSARIAL LOSS SETTING #
74+
###########################################################
75+
lambda_adv: 4.0 # Loss balancing coefficient.
76+
77+
###########################################################
78+
# DATA LOADER SETTING #
79+
###########################################################
80+
batch_size: 8 # Batch size.
81+
batch_max_steps: 25500 # Length of each audio in batch. Make sure dividable by n_shift.
82+
num_workers: 1 # Number of workers in DataLoader.
83+
84+
###########################################################
85+
# OPTIMIZER & SCHEDULER SETTING #
86+
###########################################################
87+
generator_optimizer_params:
88+
epsilon: 1.0e-6 # Generator's epsilon.
89+
weight_decay: 0.0 # Generator's weight decay coefficient.
90+
generator_scheduler_params:
91+
learning_rate: 0.0001 # Generator's learning rate.
92+
step_size: 200000 # Generator's scheduler step size.
93+
gamma: 0.5 # Generator's scheduler gamma.
94+
# At each step size, lr will be multiplied by this parameter.
95+
generator_grad_norm: 10 # Generator's gradient norm.
96+
discriminator_optimizer_params:
97+
epsilon: 1.0e-6 # Discriminator's epsilon.
98+
weight_decay: 0.0 # Discriminator's weight decay coefficient.
99+
discriminator_scheduler_params:
100+
learning_rate: 0.00005 # Discriminator's learning rate.
101+
step_size: 200000 # Discriminator's scheduler step size.
102+
gamma: 0.5 # Discriminator's scheduler gamma.
103+
# At each step size, lr will be multiplied by this parameter.
104+
discriminator_grad_norm: 1 # Discriminator's gradient norm.
105+
106+
###########################################################
107+
# INTERVAL SETTING #
108+
###########################################################
109+
discriminator_train_start_steps: 100000 # Number of steps to start to train discriminator.
110+
train_max_steps: 400000 # Number of training steps.
111+
save_interval_steps: 5000 # Interval steps to save checkpoint.
112+
eval_interval_steps: 1000 # Interval steps to evaluate the network.
113+
114+
###########################################################
115+
# OTHER SETTING #
116+
###########################################################
117+
num_save_intermediate_results: 4 # Number of results to be saved as intermediate results.
118+
num_snapshots: 10 # max number of snapshots to keep while training
119+
seed: 42 # random seed for paddle, random, and np.random
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../csmsc/voc1/local/PTQ_static.sh
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/bin/bash
2+
3+
config_path=$1
4+
train_output_path=$2
5+
ckpt_name=$3
6+
7+
FLAGS_allocator_strategy=naive_best_fit \
8+
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
9+
python3 ${BIN_DIR}/../../dygraph_to_static.py \
10+
--type=voc \
11+
--voc=pwgan_opencpop \
12+
--voc_config=${config_path} \
13+
--voc_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
14+
--voc_stat=dump/train/feats_stats.npy \
15+
--inference_dir=exp/default/inference/
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
#!/bin/bash
2+
3+
stage=0
4+
stop_stage=100
5+
6+
config_path=$1
7+
8+
9+
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
10+
# extract features
11+
echo "Extract features ..."
12+
python3 ${BIN_DIR}/../preprocess.py \
13+
--rootdir=~/datasets/Opencpop/segments/ \
14+
--dataset=opencpop \
15+
--dumpdir=dump \
16+
--dur-file=~/datasets/Opencpop/segments/transcriptions.txt \
17+
--config=${config_path} \
18+
--cut-sil=False \
19+
--num-cpu=20
20+
fi
21+
22+
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
23+
# get features' stats(mean and std)
24+
echo "Get features' stats ..."
25+
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
26+
--metadata=dump/train/raw/metadata.jsonl \
27+
--field-name="feats"
28+
fi
29+
30+
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
31+
# normalize, dev and test should use train's stats
32+
echo "Normalize ..."
33+
34+
python3 ${BIN_DIR}/../normalize.py \
35+
--metadata=dump/train/raw/metadata.jsonl \
36+
--dumpdir=dump/train/norm \
37+
--stats=dump/train/feats_stats.npy
38+
python3 ${BIN_DIR}/../normalize.py \
39+
--metadata=dump/dev/raw/metadata.jsonl \
40+
--dumpdir=dump/dev/norm \
41+
--stats=dump/train/feats_stats.npy
42+
43+
python3 ${BIN_DIR}/../normalize.py \
44+
--metadata=dump/test/raw/metadata.jsonl \
45+
--dumpdir=dump/test/norm \
46+
--stats=dump/train/feats_stats.npy
47+
fi
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../csmsc/voc1/local/synthesize.sh
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../csmsc/voc1/local/train.sh

examples/opencpop/voc1/path.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../csmsc/voc1/path.sh

examples/opencpop/voc1/run.sh

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
#!/bin/bash
2+
3+
set -e
4+
source path.sh
5+
6+
gpus=0
7+
stage=0
8+
stop_stage=100
9+
10+
conf_path=conf/default.yaml
11+
train_output_path=exp/default
12+
ckpt_name=snapshot_iter_100000.pdz
13+
14+
# with the following command, you can choose the stage range you want to run
15+
# such as `./run.sh --stage 0 --stop-stage 0`
16+
# this can not be mixed use with `$1`, `$2` ...
17+
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
18+
19+
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
20+
# prepare data
21+
./local/preprocess.sh ${conf_path} || exit -1
22+
fi
23+
24+
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
25+
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
26+
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
27+
fi
28+
29+
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
30+
# synthesize
31+
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
32+
fi
33+
34+
# dygraph to static
35+
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
36+
CUDA_VISIBLE_DEVICES=${gpus} ./local/dygraph_to_static.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
37+
fi
38+
39+
# PTQ_static
40+
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
41+
CUDA_VISIBLE_DEVICES=${gpus} ./local/PTQ_static.sh ${train_output_path} pwgan_opencpop || exit -1
42+
fi

paddlespeech/t2s/exps/PTQ_static.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ def parse_args():
4242
'hifigan_aishell3',
4343
'hifigan_ljspeech',
4444
'hifigan_vctk',
45+
'pwgan_opencpop',
4546
],
4647
help='Choose model type of tts task.')
4748

0 commit comments

Comments
 (0)