-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Performance Issue: CLI Hangs When Processing Large Document Collections
Summary
The QuickMark CLI hangs indefinitely when processing large collections of markdown files, particularly when the collection contains very large files (>2MB). This was discovered when attempting to lint the GitLab documentation corpus (2659 files).
Environment
- QuickMark Version: 1.0.0
- Platform: Linux
- Test Case: GitLab docs directory with 2659 markdown files
- Command:
./target/debug/qmark scripts/benchmarks/data/gitlab/doc
Problem Description
Symptoms
- CLI process hangs indefinitely with no output
- High CPU usage across all cores
- Eventually leads to system unresponsiveness
- No error messages or progress indicators
Root Cause Analysis
After adding comprehensive debug logging, the issue was traced to:
-
Very Large File Processing: A single 3.3MB markdown file (
api/graphql/reference/_index.md) that takes:- 1.86 seconds for tree-sitter parsing
- 545ms for context creation (building node cache)
-
Uncontrolled Parallelism: With 2659 files processed in parallel via Rayon:
- Multiple large files being parsed simultaneously
- Excessive memory allocation for AST trees and node caches
- System memory exhaustion leading to thrashing
-
No File Size Awareness: Current implementation treats all files equally regardless of size
Reproduction Steps
- Download a large markdown corpus (e.g., GitLab docs)
- Run:
./target/debug/qmark /path/to/large/corpus - Process hangs with no progress indication
Minimal reproduction with the problematic file:
time ./target/debug/qmark "/path/to/api/graphql/reference/_index.md"Takes ~3+ seconds for a single 3.3MB file.
Suggested Workaround
A potential workaround could involve:
- Classifying files by size (very large >2MB, large >500KB, small <500KB)
- Skipping very large files with warning messages
- Processing large files sequentially to avoid memory pressure
- Processing small files in parallel with controlled thread limits
- Expected result: Should allow GitLab docs to process in reasonable time instead of hanging
Impact
This affects:
- Performance benchmarking: Cannot accurately measure against real-world document sets
- User experience: CLI appears broken when processing large repositories
- Adoption: Users with large documentation sets will experience failures
Proposed Solutions
Short-term (Immediate)
- Implement file size classification (already prototyped)
- Add progress indicators for user feedback
- Add memory usage monitoring and warnings
- Document limitations in README/help
Medium-term (Performance)
- Streaming AST processing: Process documents in chunks rather than loading entire AST
- Configurable parallelism: Allow users to control thread count based on system resources
- Memory-mapped file reading: Reduce memory footprint for large files
- Rule-specific optimizations: Skip expensive rules for very large files
Long-term (Architecture)
- Incremental parsing: Only parse changed sections for repeat runs
- External AST caching: Persist parsed trees to disk
- Distributed processing: Split large jobs across multiple processes
Testing Requirements
Any fix should be tested against:
- ✅ Small file sets (< 100 files)
- ✅ Medium file sets (100-1000 files)
- ✅ Large file sets (1000+ files)
- ✅ Very large individual files (>1MB)
- ❌ Mixed size collections (current failure case)
Performance Baseline
Target performance characteristics:
- Small files: >1000 files/second
- Medium files: >100 files/second
- Large files: >10 files/second
- Memory usage: Should not exceed 2GB for any reasonable document set
- Progress feedback: Updates at least every 5 seconds for long-running operations
Additional Context
- This issue was discovered during benchmarking against markdownlint
- The GitLab docs represent a realistic real-world test case
- Performance degradation is non-linear with file count due to parallel processing overhead
- Debug logging infrastructure would be helpful for future performance analysis
Areas Requiring Changes
Potential areas that would need modification for a fix:
crates/quickmark-cli/src/main.rs: File processing logic and parallelism controlcrates/quickmark-cli/Cargo.toml: May need additional dependencies for progress reportingcrates/quickmark-core/src/linter.rs: Core linting performance optimizationscrates/quickmark-core/Cargo.toml: May need logging dependencies for debugging
Debug Information
To reproduce the analysis, enable debug logging:
RUST_LOG=debug ./target/debug/qmark /path/to/filesKey metrics to monitor:
- File discovery time
- Per-file parsing time
- Context creation time
- AST walk time
- Total memory usage