Skip to content

Excessive Memory Usage and Crash When Tokenizing Large JSONL Datasets #20

@fcltd

Description

@fcltd

Summary:
When processing large JSONL datasets (2GB), the script exhibits excessive memory usage, eventually consuming all available RAM (even on a system with 240GB). The dataset loading process takes an extended period (2+ hours), and memory usage continues to climb indefinitely until the system becomes unresponsive or crashes, never reaches the point were training starts.

Steps to Reproduce:
Load a large JSONL dataset (2GB+).
The script attempts to tokenize and load the entire dataset into memory at once.
Memory usage steadily increases, reaching extreme levels (99.9% of total RAM).
The process remains stuck for hours and eventually crashes due to out-of-memory conditions.

Expected Behavior:
The dataset should be processed efficiently without excessive RAM consumption.
Tokenization should utilize a streaming or chunk-based approach instead of loading the entire dataset into memory.
Training should start within a reasonable timeframe without requiring extreme amounts of RAM.

Proposed Solution:
Implement streaming-based tokenization to process JSONL data in smaller batches instead of loading everything into memory at once.
Utilize Hugging Face’s datasets library (Dataset.from_generator()) or another efficient method to handle large datasets without excessive memory usage.
Introduce memory limits or chunking strategies to prevent out-of-memory crashes.

System Information:
Oobabooga WebUI version: v2.5 (Latest) - Training_PRO Version: Latest
GPU: RTX 3090 x 2 / A100-80G x4
Total RAM: 64GB / 240GB
Operating System: Windows 10 / Ubuntu 22.04

Image

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions