What's Changed
- fix by @kylematoba in #314
- Changed FTFY defaults by @guipenedo in #319
- Adding Megatron Tokenization pipeline by @TJ-Solergibert in #304
- Add
job_id_position
Parameter tolaunch_slurm_job
Method by @StephenRebel in #282 - load_tokenizer can now load local hf folder by @ceferisbarov in #306
- Add glob pattern for hash index by @jordane95 in #313
- fix(utils): Enhance the dependencies check to include pip distribution by @aiqwe in #317
- Update README.md by @saforem2 in #323
- Fix issues with URL Deduplication when using the Index by @muzzynine in #327
- Add customization for fetching SLURM job id by @BramVanroy in #320
- fixes stopwors implementation by @guipenedo in #329
- Allow custom parquet schema by @BramVanroy in #330
- [draft] Add chunking option to DocumentTokenizer by @craffel in #342
- Revert "[draft] Add chunking option to DocumentTokenizer" by @guipenedo in #343
- fix: root condition for SENTINEL by @jordane95 in #349
- correct metadata parsing for finemath by @VivienCabannes in #355
- add oom score + shorter polling by @hynky1999 in #361
- Resolve issue 308 by @habanoz in #309
- [draft] Add chunking option to DocumentTokenizer by @craffel in #344
- Add RayPipelineExecutor by @nelson-liu in #331
- Bump ring from 0.17.8 to 0.17.14 in /src/datatrove/tools/fast_mh3 by @dependabot in #363
- Bump tokio from 1.41.1 to 1.43.1 in /src/datatrove/tools/fast_mh3 by @dependabot in #362
- Fix signatures priority queue initialization in MinhashBuildIndex by @nelson-liu in #334
- Shuffle by chunks support in DocumentTokenizerMerger by @guipenedo in #364
- return positions based on .index if return_positions=True in the data… by @guipenedo in #356
New Contributors
- @kylematoba made their first contribution in #314
- @StephenRebel made their first contribution in #282
- @ceferisbarov made their first contribution in #306
- @saforem2 made their first contribution in #323
- @muzzynine made their first contribution in #327
- @craffel made their first contribution in #342
- @VivienCabannes made their first contribution in #355
- @habanoz made their first contribution in #309
- @nelson-liu made their first contribution in #331
- @dependabot made their first contribution in #363
Full Changelog: v0.4.0...v0.5.0