Excessive Memory Usage and Crash When Tokenizing Large JSONL Datasets

**Summary:**
When processing large JSONL datasets (2GB), the script exhibits excessive memory usage, eventually consuming all available RAM (even on a system with 240GB). The dataset loading process takes an extended period (2+ hours), and memory usage continues to climb indefinitely until the system becomes unresponsive or crashes, never reaches the point were training starts.

**Steps to Reproduce:**
Load a large JSONL dataset (2GB+).
The script attempts to tokenize and load the entire dataset into memory at once.
Memory usage steadily increases, reaching extreme levels (99.9% of total RAM).
The process remains stuck for hours and eventually crashes due to out-of-memory conditions.

**Expected Behavior:**
The dataset should be processed efficiently without excessive RAM consumption.
Tokenization should utilize a streaming or chunk-based approach instead of loading the entire dataset into memory.
Training should start within a reasonable timeframe without requiring extreme amounts of RAM.

**Proposed Solution:**
Implement streaming-based tokenization to process JSONL data in smaller batches instead of loading everything into memory at once.
Utilize Hugging Face’s datasets library (Dataset.from_generator()) or another efficient method to handle large datasets without excessive memory usage.
Introduce memory limits or chunking strategies to prevent out-of-memory crashes.

**System Information:**
Oobabooga WebUI version: v2.5 (Latest) - Training_PRO Version: Latest
GPU: RTX 3090 x 2 / A100-80G  x4
Total RAM: 64GB / 240GB
Operating System: Windows 10 / Ubuntu 22.04

![Image](https://github.com/user-attachments/assets/31214719-aa9c-41fe-b1f1-8bdf8b7a1b20)

![Image](https://github.com/user-attachments/assets/9cc8d9d0-cce0-4ac8-b436-f00e891792de)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Excessive Memory Usage and Crash When Tokenizing Large JSONL Datasets #20

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Excessive Memory Usage and Crash When Tokenizing Large JSONL Datasets #20

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions