Skip to content

Performance Issue: CLI Hangs When Processing Large Document Collections #137

@ekropotin

Description

@ekropotin

Performance Issue: CLI Hangs When Processing Large Document Collections

Summary

The QuickMark CLI hangs indefinitely when processing large collections of markdown files, particularly when the collection contains very large files (>2MB). This was discovered when attempting to lint the GitLab documentation corpus (2659 files).

Environment

  • QuickMark Version: 1.0.0
  • Platform: Linux
  • Test Case: GitLab docs directory with 2659 markdown files
  • Command: ./target/debug/qmark scripts/benchmarks/data/gitlab/doc

Problem Description

Symptoms

  • CLI process hangs indefinitely with no output
  • High CPU usage across all cores
  • Eventually leads to system unresponsiveness
  • No error messages or progress indicators

Root Cause Analysis

After adding comprehensive debug logging, the issue was traced to:

  1. Very Large File Processing: A single 3.3MB markdown file (api/graphql/reference/_index.md) that takes:

    • 1.86 seconds for tree-sitter parsing
    • 545ms for context creation (building node cache)
  2. Uncontrolled Parallelism: With 2659 files processed in parallel via Rayon:

    • Multiple large files being parsed simultaneously
    • Excessive memory allocation for AST trees and node caches
    • System memory exhaustion leading to thrashing
  3. No File Size Awareness: Current implementation treats all files equally regardless of size

Reproduction Steps

  1. Download a large markdown corpus (e.g., GitLab docs)
  2. Run: ./target/debug/qmark /path/to/large/corpus
  3. Process hangs with no progress indication

Minimal reproduction with the problematic file:

time ./target/debug/qmark "/path/to/api/graphql/reference/_index.md"

Takes ~3+ seconds for a single 3.3MB file.

Suggested Workaround

A potential workaround could involve:

  • Classifying files by size (very large >2MB, large >500KB, small <500KB)
  • Skipping very large files with warning messages
  • Processing large files sequentially to avoid memory pressure
  • Processing small files in parallel with controlled thread limits
  • Expected result: Should allow GitLab docs to process in reasonable time instead of hanging

Impact

This affects:

  • Performance benchmarking: Cannot accurately measure against real-world document sets
  • User experience: CLI appears broken when processing large repositories
  • Adoption: Users with large documentation sets will experience failures

Proposed Solutions

Short-term (Immediate)

  1. Implement file size classification (already prototyped)
  2. Add progress indicators for user feedback
  3. Add memory usage monitoring and warnings
  4. Document limitations in README/help

Medium-term (Performance)

  1. Streaming AST processing: Process documents in chunks rather than loading entire AST
  2. Configurable parallelism: Allow users to control thread count based on system resources
  3. Memory-mapped file reading: Reduce memory footprint for large files
  4. Rule-specific optimizations: Skip expensive rules for very large files

Long-term (Architecture)

  1. Incremental parsing: Only parse changed sections for repeat runs
  2. External AST caching: Persist parsed trees to disk
  3. Distributed processing: Split large jobs across multiple processes

Testing Requirements

Any fix should be tested against:

  • ✅ Small file sets (< 100 files)
  • ✅ Medium file sets (100-1000 files)
  • ✅ Large file sets (1000+ files)
  • ✅ Very large individual files (>1MB)
  • ❌ Mixed size collections (current failure case)

Performance Baseline

Target performance characteristics:

  • Small files: >1000 files/second
  • Medium files: >100 files/second
  • Large files: >10 files/second
  • Memory usage: Should not exceed 2GB for any reasonable document set
  • Progress feedback: Updates at least every 5 seconds for long-running operations

Additional Context

  • This issue was discovered during benchmarking against markdownlint
  • The GitLab docs represent a realistic real-world test case
  • Performance degradation is non-linear with file count due to parallel processing overhead
  • Debug logging infrastructure would be helpful for future performance analysis

Areas Requiring Changes

Potential areas that would need modification for a fix:

  • crates/quickmark-cli/src/main.rs: File processing logic and parallelism control
  • crates/quickmark-cli/Cargo.toml: May need additional dependencies for progress reporting
  • crates/quickmark-core/src/linter.rs: Core linting performance optimizations
  • crates/quickmark-core/Cargo.toml: May need logging dependencies for debugging

Debug Information

To reproduce the analysis, enable debug logging:

RUST_LOG=debug ./target/debug/qmark /path/to/files

Key metrics to monitor:

  • File discovery time
  • Per-file parsing time
  • Context creation time
  • AST walk time
  • Total memory usage

Metadata

Metadata

Assignees

No one assigned

    Labels

    status: triageNeeds review and prioritization

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions