Performance Issue: CLI Hangs When Processing Large Document Collections

# Performance Issue: CLI Hangs When Processing Large Document Collections

## Summary

The QuickMark CLI hangs indefinitely when processing large collections of markdown files, particularly when the collection contains very large files (>2MB). This was discovered when attempting to lint the GitLab documentation corpus (2659 files).

## Environment

- **QuickMark Version**: 1.0.0
- **Platform**: Linux
- **Test Case**: GitLab docs directory with 2659 markdown files
- **Command**: `./target/debug/qmark scripts/benchmarks/data/gitlab/doc`

## Problem Description

### Symptoms
- CLI process hangs indefinitely with no output
- High CPU usage across all cores
- Eventually leads to system unresponsiveness
- No error messages or progress indicators

### Root Cause Analysis
After adding comprehensive debug logging, the issue was traced to:

1. **Very Large File Processing**: A single 3.3MB markdown file (`api/graphql/reference/_index.md`) that takes:
   - 1.86 seconds for tree-sitter parsing
   - 545ms for context creation (building node cache)
   
2. **Uncontrolled Parallelism**: With 2659 files processed in parallel via Rayon:
   - Multiple large files being parsed simultaneously
   - Excessive memory allocation for AST trees and node caches
   - System memory exhaustion leading to thrashing

3. **No File Size Awareness**: Current implementation treats all files equally regardless of size

## Reproduction Steps

1. Download a large markdown corpus (e.g., GitLab docs)
2. Run: `./target/debug/qmark /path/to/large/corpus`
3. Process hangs with no progress indication

**Minimal reproduction** with the problematic file:
```bash
time ./target/debug/qmark "/path/to/api/graphql/reference/_index.md"
```
Takes ~3+ seconds for a single 3.3MB file.

## Suggested Workaround

A potential workaround could involve:
- Classifying files by size (very large >2MB, large >500KB, small <500KB)
- Skipping very large files with warning messages
- Processing large files sequentially to avoid memory pressure
- Processing small files in parallel with controlled thread limits
- **Expected result**: Should allow GitLab docs to process in reasonable time instead of hanging

## Impact

This affects:
- **Performance benchmarking**: Cannot accurately measure against real-world document sets
- **User experience**: CLI appears broken when processing large repositories
- **Adoption**: Users with large documentation sets will experience failures

## Proposed Solutions

### Short-term (Immediate)
1. **Implement file size classification** (already prototyped)
2. **Add progress indicators** for user feedback
3. **Add memory usage monitoring** and warnings
4. **Document limitations** in README/help

### Medium-term (Performance)
1. **Streaming AST processing**: Process documents in chunks rather than loading entire AST
2. **Configurable parallelism**: Allow users to control thread count based on system resources
3. **Memory-mapped file reading**: Reduce memory footprint for large files
4. **Rule-specific optimizations**: Skip expensive rules for very large files

### Long-term (Architecture)
1. **Incremental parsing**: Only parse changed sections for repeat runs
2. **External AST caching**: Persist parsed trees to disk
3. **Distributed processing**: Split large jobs across multiple processes

## Testing Requirements

Any fix should be tested against:
- ✅ Small file sets (< 100 files)
- ✅ Medium file sets (100-1000 files) 
- ✅ Large file sets (1000+ files)
- ✅ Very large individual files (>1MB)
- ❌ Mixed size collections (current failure case)

## Performance Baseline

Target performance characteristics:
- **Small files**: >1000 files/second
- **Medium files**: >100 files/second  
- **Large files**: >10 files/second
- **Memory usage**: Should not exceed 2GB for any reasonable document set
- **Progress feedback**: Updates at least every 5 seconds for long-running operations

## Additional Context

- This issue was discovered during benchmarking against markdownlint
- The GitLab docs represent a realistic real-world test case
- Performance degradation is non-linear with file count due to parallel processing overhead
- Debug logging infrastructure would be helpful for future performance analysis

## Areas Requiring Changes

Potential areas that would need modification for a fix:
- `crates/quickmark-cli/src/main.rs`: File processing logic and parallelism control
- `crates/quickmark-cli/Cargo.toml`: May need additional dependencies for progress reporting
- `crates/quickmark-core/src/linter.rs`: Core linting performance optimizations
- `crates/quickmark-core/Cargo.toml`: May need logging dependencies for debugging

## Debug Information

To reproduce the analysis, enable debug logging:
```bash
RUST_LOG=debug ./target/debug/qmark /path/to/files
```

Key metrics to monitor:
- File discovery time
- Per-file parsing time  
- Context creation time
- AST walk time
- Total memory usage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Issue: CLI Hangs When Processing Large Document Collections #137

Performance Issue: CLI Hangs When Processing Large Document Collections

Summary

Environment

Problem Description

Symptoms

Root Cause Analysis

Reproduction Steps

Suggested Workaround

Impact

Proposed Solutions

Short-term (Immediate)

Medium-term (Performance)

Long-term (Architecture)

Testing Requirements

Performance Baseline

Additional Context

Areas Requiring Changes

Debug Information

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Performance Issue: CLI Hangs When Processing Large Document Collections #137

Description

Performance Issue: CLI Hangs When Processing Large Document Collections

Summary

Environment

Problem Description

Symptoms

Root Cause Analysis

Reproduction Steps

Suggested Workaround

Impact

Proposed Solutions

Short-term (Immediate)

Medium-term (Performance)

Long-term (Architecture)

Testing Requirements

Performance Baseline

Additional Context

Areas Requiring Changes

Debug Information

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions