Joined dictionary for more than 2 languages

Hello, in the multilingual translation example, a joined dictionary is created between de-en, then the resulting dictionary is used for fr-en. In this case I think it's fine because there are probably a lot of overlaps in the vocabulary among these languages. However, what if I have 3 really different languages with fewer overlaps, for instance English-Korean-Chinese, each having their own writing systems? For example, if I create a joined dictionary for English-Korean first, then a lot of Chinese subwords may be missing in the final dictionary.

One workaround that I did is to combine the training data from all languages, then call `fairseq-preprocess` once to generate a joined dictionary. After that, I run `fairseq-preprocess` separately on each language pair, re-using the joined dictionary in the first step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joined dictionary for more than 2 languages #859

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Joined dictionary for more than 2 languages #859

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions