Summary
Add an xlsx_read tool that extracts plain text and numeric data from XLSX files, completing the Office document family alongside docx_read and pptx_read.
Problem statement
Agents encounter Excel spreadsheets in real-world workflows — financial reports, inventory lists, configuration
tables, data exports. Currently the agent has no way to read XLSX files: it can read PDF, DOCX, and PPTX, but asking it to
inspect a spreadsheet fails silently or requires the user to manually convert to CSV first. This gap breaks otherwise seamless
document-processing pipelines.
Proposed solution
Implement xlsx_read as a new Tool trait implementation in src/tools/xlsx_read.rs, following the same security pipeline and ZIP
- quick-xml parsing approach used by docx_read and pptx_read. The tool should:
- Parse xl/sharedStrings.xml for the string pool, xl/workbook.xml + rels for sheet ordering, and xl/worksheets/sheet*.xml for
cell data
- Output tab-separated values per row, newline-separated rows, with --- Sheet: --- headers for multi-sheet workbooks
- Support path (required) and max_chars (optional, default 50000, max 200000) parameters
- Enforce the same security checks as sibling tools: rate limit, path allowlist, canonicalization, resolved-path check, 50 MB
file size cap, 16 MB cumulative XML guard
Non-goals / out of scope
- No .xls (legacy binary format) support — requires a completely different parser
- No formula evaluation — returns raw cell values only
- No chart, style, image, or merged-cell awareness
- No new Cargo dependencies — reuses existing zip + quick-xml
- No feature flag — consistent with docx_read / pptx_read registration
Alternatives considered
- Add a third-party crate like calamine — adds a new dependency for a task achievable with existing zip + quick-xml, violating
the project's minimal-dependency principle
- Require users to pre-convert XLSX to CSV — poor UX and breaks autonomous agent workflows
- Do nothing — leaves a clear gap in the document extraction family
Acceptance criteria
- xlsx_read registered in all_tools_with_runtime and callable by the agent
- Shared-string, numeric, boolean, and inline-string cell types correctly extracted
- Multi-sheet workbooks produce labeled, ordered output
- Fallback extraction works when workbook.xml is absent
- Full security pipeline (rate limit, path policy, symlink escape, size cap, XML bomb guard) tested
- cargo fmt, cargo clippy -D warnings, and all unit tests pass
- Tested against a real Excel-generated XLSX file
Architecture impact
- src/tools/xlsx_read.rs — new file (~480 lines including tests)
- src/tools/mod.rs — module declaration, re-export, registration (3 insertions)
- No other subsystems affected
Risk and rollback
- Risk: Low — additive-only change, no existing behavior modified, zero new dependencies
- Rollback: Revert the single commit; or remove the tool_arcs.push(Arc::new(XlsxReadTool::...)) line to disable without
removing code
Breaking change?
No
Data hygiene checks
Summary
Add an xlsx_read tool that extracts plain text and numeric data from XLSX files, completing the Office document family alongside docx_read and pptx_read.
Problem statement
Agents encounter Excel spreadsheets in real-world workflows — financial reports, inventory lists, configuration
tables, data exports. Currently the agent has no way to read XLSX files: it can read PDF, DOCX, and PPTX, but asking it to
inspect a spreadsheet fails silently or requires the user to manually convert to CSV first. This gap breaks otherwise seamless
document-processing pipelines.
Proposed solution
Implement xlsx_read as a new Tool trait implementation in src/tools/xlsx_read.rs, following the same security pipeline and ZIP
cell data
file size cap, 16 MB cumulative XML guard
Non-goals / out of scope
Alternatives considered
the project's minimal-dependency principle
Acceptance criteria
Architecture impact
Risk and rollback
removing code
Breaking change?
No
Data hygiene checks