Region Detection Heuristics

This document describes the heuristics used for automatic region detection and classification in spreadsheets. These are best-effort guesses based on structural patterns - always verify with actual data.

Current Limitations

Important: These heuristics were developed and tested primarily against a single complex financial spreadsheet. They may not generalize well to:

Simple flat data tables
Pivot tables or crosstab layouts
Heavily styled/formatted sheets where structure comes from formatting not data
Non-English spreadsheets with different labeling conventions
Spreadsheets with merged cells (not yet handled)

Classification Labels

All classifications are prefixed with likely_ to indicate uncertainty:

likely_data - Tabular data with headers
likely_parameters - Key-value configuration/input cells
likely_calculator - Formula-heavy computation regions (>55% formulas)
likely_outputs - Mixed formula regions (25-55% formulas)
likely_metadata - Labels, titles, or sparse informational content
unknown - Could not classify with confidence

Key-Value Layout Detection

Purpose: Identify vertical parameter/config regions (label in col A, value in col B pattern)

Assumptions:

Key-value layouts have exactly 2 "dense" columns (≥40% fill rate in sampled rows)
Labels are short text (≤25 chars), contain letters, no digits
Values are numbers, dates, or text longer than 2 chars
At least 3 valid label-value pairs in first 15 rows
At least 30% of sampled rows are valid pairs

Known Issues:

Sparse columns (like occasional notes in col D) can trip detection if they happen to have values in the sampled rows
Doesn't handle horizontal key-value layouts (row 1 = labels, row 2 = values)
Labels with numbers (e.g., "Rate 1", "Tier 2") are rejected as keys

Header Detection

Purpose: Find the row containing column headers for tabular data

Assumptions:

Headers are in one of the first 3 rows of a region
Header rows have more text cells than numeric cells
Header values are relatively unique (not repeated)
Data-like values (proper nouns >5 chars, strings with digits >3 chars, very long strings >40 chars) are penalized
Date values in a potential header row reduce its score

Known Issues:

Proper noun detection is naive (just checks capitalization + length)
Doesn't handle multi-row headers well
English-centric assumptions about what "looks like" a header

Region Classification

Formula Ratio Thresholds:

likely_calculator: >55% formula cells
likely_outputs: 25-55% formula cells
likely_parameters: <25% formulas AND (key-value layout OR narrow ≤3 cols)
likely_metadata: Few non-empty cells, mostly text
likely_data: Default fallback

Confidence Scoring:

Based on formula consistency, cell density, header quality
Confidence <0.5 indicates uncertain classification
Always check confidence field before trusting classification

Column Density Calculation

When detecting key-value layouts in regions wider than 2 columns:

Sample first 20 rows (or all rows if fewer)
Count non-null cells per column
Columns with ≥40% fill rate are "dense"
If exactly 2 dense columns exist, treat as potential key-value layout

Future Improvements Needed

Diverse test corpus: Current tests use synthetic micro-sheets; need real-world variety
Merged cell handling: Currently ignored, can break region detection
Style-based hints: Bold headers, borders, background colors carry semantic meaning
Horizontal key-value: Support label-row / value-row patterns
Multi-region awareness: Adjacent regions with different structures (e.g., config block next to data table)
Confidence calibration: Current confidence scores are not well-calibrated to actual accuracy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Region Detection Heuristics

Current Limitations

Classification Labels

Key-Value Layout Detection

Header Detection

Region Classification

Column Density Calculation

Future Improvements Needed

FilesExpand file tree

HEURISTICS.md

Latest commit

History

HEURISTICS.md

File metadata and controls

Region Detection Heuristics

Current Limitations

Classification Labels

Key-Value Layout Detection

Header Detection

Region Classification

Column Density Calculation

Future Improvements Needed