-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Summary:
When processing large JSONL datasets (2GB), the script exhibits excessive memory usage, eventually consuming all available RAM (even on a system with 240GB). The dataset loading process takes an extended period (2+ hours), and memory usage continues to climb indefinitely until the system becomes unresponsive or crashes, never reaches the point were training starts.
Steps to Reproduce:
Load a large JSONL dataset (2GB+).
The script attempts to tokenize and load the entire dataset into memory at once.
Memory usage steadily increases, reaching extreme levels (99.9% of total RAM).
The process remains stuck for hours and eventually crashes due to out-of-memory conditions.
Expected Behavior:
The dataset should be processed efficiently without excessive RAM consumption.
Tokenization should utilize a streaming or chunk-based approach instead of loading the entire dataset into memory.
Training should start within a reasonable timeframe without requiring extreme amounts of RAM.
Proposed Solution:
Implement streaming-based tokenization to process JSONL data in smaller batches instead of loading everything into memory at once.
Utilize Hugging Face’s datasets library (Dataset.from_generator()) or another efficient method to handle large datasets without excessive memory usage.
Introduce memory limits or chunking strategies to prevent out-of-memory crashes.
System Information:
Oobabooga WebUI version: v2.5 (Latest) - Training_PRO Version: Latest
GPU: RTX 3090 x 2 / A100-80G x4
Total RAM: 64GB / 240GB
Operating System: Windows 10 / Ubuntu 22.04

