Quick converter for the CHMK free XML subcorpus.
- Source ZIP:
data/XML-CHMK_v2.2_free_subcorpus.zip - Unpacked XML files:
data/unpacked/XML-CHMK_v2.2_free_subcorpus/ - Export script:
main.py
Processed CSV output is excluded from Git with:
data/processed/*
So you can regenerate locally but keep the repo size small.
uv run main.pyThis creates:
data/processed/chmk_sentences.csvdata/processed/chmk_overview.csv
chmk_sentences.csv: readable sentence-level rows (source,year, inferredcanton, sentence text, etc.)chmk_overview.csv: one row per XML source with sentence counts, sorted by year