This document describes the heuristics used for automatic region detection and classification in spreadsheets. These are best-effort guesses based on structural patterns - always verify with actual data.
Important: These heuristics were developed and tested primarily against a single complex financial spreadsheet. They may not generalize well to:
- Simple flat data tables
- Pivot tables or crosstab layouts
- Heavily styled/formatted sheets where structure comes from formatting not data
- Non-English spreadsheets with different labeling conventions
- Spreadsheets with merged cells (not yet handled)
All classifications are prefixed with likely_ to indicate uncertainty:
likely_data- Tabular data with headerslikely_parameters- Key-value configuration/input cellslikely_calculator- Formula-heavy computation regions (>55% formulas)likely_outputs- Mixed formula regions (25-55% formulas)likely_metadata- Labels, titles, or sparse informational contentunknown- Could not classify with confidence
Purpose: Identify vertical parameter/config regions (label in col A, value in col B pattern)
Assumptions:
- Key-value layouts have exactly 2 "dense" columns (≥40% fill rate in sampled rows)
- Labels are short text (≤25 chars), contain letters, no digits
- Values are numbers, dates, or text longer than 2 chars
- At least 3 valid label-value pairs in first 15 rows
- At least 30% of sampled rows are valid pairs
Known Issues:
- Sparse columns (like occasional notes in col D) can trip detection if they happen to have values in the sampled rows
- Doesn't handle horizontal key-value layouts (row 1 = labels, row 2 = values)
- Labels with numbers (e.g., "Rate 1", "Tier 2") are rejected as keys
Purpose: Find the row containing column headers for tabular data
Assumptions:
- Headers are in one of the first 3 rows of a region
- Header rows have more text cells than numeric cells
- Header values are relatively unique (not repeated)
- Data-like values (proper nouns >5 chars, strings with digits >3 chars, very long strings >40 chars) are penalized
- Date values in a potential header row reduce its score
Known Issues:
- Proper noun detection is naive (just checks capitalization + length)
- Doesn't handle multi-row headers well
- English-centric assumptions about what "looks like" a header
Formula Ratio Thresholds:
likely_calculator: >55% formula cellslikely_outputs: 25-55% formula cellslikely_parameters: <25% formulas AND (key-value layout OR narrow ≤3 cols)likely_metadata: Few non-empty cells, mostly textlikely_data: Default fallback
Confidence Scoring:
- Based on formula consistency, cell density, header quality
- Confidence <0.5 indicates uncertain classification
- Always check
confidencefield before trusting classification
When detecting key-value layouts in regions wider than 2 columns:
- Sample first 20 rows (or all rows if fewer)
- Count non-null cells per column
- Columns with ≥40% fill rate are "dense"
- If exactly 2 dense columns exist, treat as potential key-value layout
- Diverse test corpus: Current tests use synthetic micro-sheets; need real-world variety
- Merged cell handling: Currently ignored, can break region detection
- Style-based hints: Bold headers, borders, background colors carry semantic meaning
- Horizontal key-value: Support label-row / value-row patterns
- Multi-region awareness: Adjacent regions with different structures (e.g., config block next to data table)
- Confidence calibration: Current confidence scores are not well-calibrated to actual accuracy