Theano implementation of the models described in the paper Fully Character-Level Neural Machine Translation without Explicit Segmentation.
We present code for training and decoding four different models:
- bilingual bpe2char (from Chung et al., 2016).
- bilingual char2char
- multilingual bpe2char
- multilingual char2char
- Theano
- Numpy
- NLTK
- CUDA (we recommend using the latest version. The version 8.0 was used in all our experiments.)
- For preprocessing and evaluation, we used scripts from MOSES.
- This code is based on Subword-NMT and dl4mt-cdec.
The original WMT'15 corpora can be downloaded from here. For the preprocessed corpora used in our experiments, see below.
- WMT'15 preprocessed corpora
To obtain the pre-trained top-performing models, see below.
- Pre-trained models (6.0GB): Tarball updated on Nov 21st 2016. The CS-EN bi-char2char model in the previous tarball was not the best-performing model.
Do the following before executing train*.py.
$ export THEANO_FLAGS=device=gpu,floatX=float32With space permitting on your GPU, it may speed up training to use cnmem:
$ export THEANO_FLAGS=device=gpu,floatX=float32,lib.cnmem=0.95,allow_gc=FalseOn a pre-2016 Titan X GPU with 12GB RAM, our bpe2char models were trained with cnmem. Our char2char models (both bilingual and multilingual) were trained without cnmem (due to lack of RAM).
Before executing the following, modify train*.py such that the correct directory containing WMT15 corpora is referenced.
$ python bpe2char/train_bi_bpe2char.py -translate <LANGUAGE_PAIR>$ python char2char/train_bi_char2char.py -translate <LANGUAGE_PAIR>$ python bpe2char/train_multi_bpe2char.py $ python char2char/train_multi_char2char.py To resume training a model from a checkpoint, simply append -re_load and -re_load_old_setting above. Make sure the checkpoint resides in the correct directory (.../dl4mt-c2c/models).
To train your models using your own dataset (and not the WMT'15 corpus), you first need to learn your vocabulary using build_dictionary_char.py or build_dictionary_word.py for char2char or bpe2char model, respectively. For the bpe2char model, you additionally need to learn your BPE segmentation rules on the source corpus using the Subword-NMT repository (see below).
Before executing the following, modify translate*.py such that the correct directory containing WMT15 corpora is referenced.
$ export THEANO_FLAGS=device=gpu,floatX=float32,lib.cnmem=0.95,allow_gc=False
$ python translate/translate_bpe2char.py -model <PATH_TO_MODEL.npz> -translate <LANGUAGE_PAIR> -saveto <DESTINATION> -which <VALID/TEST_SET> # for bpe2char models
$ python translate/translate_char2char.py -model <PATH_TO_MODEL.npz> -translate <LANGUAGE_PAIR> -saveto <DESTINATION> -which <VALID/TEST_SET> # for char2char modelsWhen choosing which pre-trained model to give to -model, make sure to choose e.g. .grads.123000.npz. The models with .grads in their names are the optimal models and you should be decoding from those.
Remove -which <VALID/TEST_SET> and append -source <PATH_TO_SOURCE>.
If you choose to decode your own source file, make sure it is:
- properly tokenized (using
preprocess/preprocess.sh). - bpe-tokenized for bpe2char models.
- Cyrillic characters should be converted to Latin for multilingual models.
Append -many (of course, provide a path to a multilingual model for -model).
We use the script from MOSES to compute the bleu score. The reference translations can be found in .../wmt15.
perl preprocess/multi-bleu.perl reference.txt < model_output.txt
Clone the Subword-NMT repository.
git clone https://github.com/rsennrich/subword-nmtUse following commands (find more information in Subword-NMT)
./learn_bpe.py -s {num_operations} < {train_file} > {codes_file}
./apply_bpe.py -c {codes_file} < {test_file}$ python preprocess/iso.py russian_source.txtwill produce an output at russian_source.txt.iso9.
@article{Lee:16,
author = {Jason Lee and Kyunghyun Cho and Thomas Hofmann},
title = {Fully Character-Level Neural Machine Translation without Explicit Segmentation},
year = {2016},
journal = {arXiv preprint arXiv:1610.03017},
}