You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Mar 20, 2026. It is now read-only.
Hello, in the multilingual translation example, a joined dictionary is created between de-en, then the resulting dictionary is used for fr-en. In this case I think it's fine because there are probably a lot of overlaps in the vocabulary among these languages. However, what if I have 3 really different languages with fewer overlaps, for instance English-Korean-Chinese, each having their own writing systems? For example, if I create a joined dictionary for English-Korean first, then a lot of Chinese subwords may be missing in the final dictionary.
One workaround that I did is to combine the training data from all languages, then call fairseq-preprocess once to generate a joined dictionary. After that, I run fairseq-preprocess separately on each language pair, re-using the joined dictionary in the first step.
Hello, in the multilingual translation example, a joined dictionary is created between de-en, then the resulting dictionary is used for fr-en. In this case I think it's fine because there are probably a lot of overlaps in the vocabulary among these languages. However, what if I have 3 really different languages with fewer overlaps, for instance English-Korean-Chinese, each having their own writing systems? For example, if I create a joined dictionary for English-Korean first, then a lot of Chinese subwords may be missing in the final dictionary.
One workaround that I did is to combine the training data from all languages, then call
fairseq-preprocessonce to generate a joined dictionary. After that, I runfairseq-preprocessseparately on each language pair, re-using the joined dictionary in the first step.