Skip to content

chuuhtetnaing/myanmar-language-dataset-collection

Repository files navigation

Myanmar Language Dataset Collection

This repository serves as a collection of Myanmar language datasets, focusing on both speech and text resources. Given the scarcity and difficulty in finding Myanmar language datasets, our goal is to create a centralized reference point for researchers, developers, and language enthusiasts. As Myanmar language resources are often challenging to locate, we encourage contributions from the community.

If you know of or have access to additional Myanmar language datasets not listed here, please consider contributing by submitting a pull request or opening an issue. Let's collaborate to build a comprehensive inventory of Myanmar language resources.

Myanmar Langauge Speech Dataset

  1. Myanmar Speech Dataset for ASR

    • This is a collection of available Myanmar speech datasets for training ASR models.
    • Datasets in this collection:
      • OpenSLR (See No.2)
      • Google Fleurs (See No.4)
    • HuggingFace Dataset
  2. Crowdsourced high-quality Burmese speech dataset (SLR80)

  3. BloomSpeech

    • HuggingFace Dataset
    • Notebook (Loading Myanmar Language)
    • Notes: Although it's showing burmese, the actual language='mya' is Palaung (De'ang / Ta'ang / Riang) language.
  4. Google Fleurs

Myanmar Langauge Text Dataset

  1. Asian Language Treebank (ALT)
    • Download Page
    • HuggingFace Dataset
    • It supports translation between following languages:
      • Myanmar (Burmese) To Bengali
      • Myanmar (Burmese) To English
      • Myanmar (Burmese) To Filipino
      • Myanmar (Burmese) To Hindi
      • Myanmar (Burmese) To Bahasa Indonesia
      • Myanmar (Burmese) To Japanese
      • Myanmar (Burmese) To Khmer
      • Myanmar (Burmese) To Lao
      • Myanmar (Burmese) To Malay
      • Myanmar (Burmese) To Thai
      • Myanmar (Burmese) To Vietnamese
      • Myanmar (Burmese) To Chinese (Simplified Chinese).
  2. A Corpus of Modern Burmese
  3. Myanmar Spoken and Written Language Dataset
  4. Myanmar NRC Format Dataset
  5. Myanmar Wikipedia Dataset
  6. Myanmar Book Corpus Dataset (MM-Lib)
  7. Myanmar C4 Dataset (Converted Zawgyi to Unicode)
  8. Myanmar CulturaX Dataset (Converted Zawgyi to Unicode)
  9. Myanmar CC100 Dataset (Converted Zawgyi to Unicode)
  10. ChannelMyanmar Movie Summary Dataset
  11. Myanmar Fineweb2 Dataset (Converted Zawgyi to Unicode)
  12. Myanmar Dhamma Article Dataset (Converted Zawgyi to Unicode)
  13. Myanmar Dhamma Question and Answer Dataset
  14. Myanmar Aya Dataset
  15. Burmese Microbiology 1K
  16. Mpox Myanmar
  17. Myanmar Agriculture 1K
  18. Myanmar Instruction Tuning Dataset
    • This is a collection of available Myanmar Question and Answer datasets for instruction fine-tuning LLM models.
    • Datasets in this collection:
      • Burmese Microbiology 1K (See No.15)
      • Mpox Myanmar (See No.16)
      • Myanmar Agriculture 1K (See No.17)
      • Myanmar Aya Dataset (See No.14)
      • Myanmar Dhamma Question and Answer Dataset (See No.13)
      • Myanmar Football Dataset (See No.21)
    • HuggingFace Dataset
    • Dataset Generting Notebook
  19. Myanmar Social Media Sentiment Analysis Dataset
  20. myXNLI - Myanmar Natural Language Inference Corpus
  21. Myanmar Football Dataset
  22. Myanmar Facebook Flores Dataset