Skip to content

[Feature]: Add xlsx_read tool for Excel spreadsheet extraction #2317

@reidliu41

Description

@reidliu41

Summary

Add an xlsx_read tool that extracts plain text and numeric data from XLSX files, completing the Office document family alongside docx_read and pptx_read.

Problem statement

Agents encounter Excel spreadsheets in real-world workflows — financial reports, inventory lists, configuration
tables, data exports. Currently the agent has no way to read XLSX files: it can read PDF, DOCX, and PPTX, but asking it to
inspect a spreadsheet fails silently or requires the user to manually convert to CSV first. This gap breaks otherwise seamless
document-processing pipelines.

Proposed solution

Implement xlsx_read as a new Tool trait implementation in src/tools/xlsx_read.rs, following the same security pipeline and ZIP

  • quick-xml parsing approach used by docx_read and pptx_read. The tool should:
  • Parse xl/sharedStrings.xml for the string pool, xl/workbook.xml + rels for sheet ordering, and xl/worksheets/sheet*.xml for
    cell data
  • Output tab-separated values per row, newline-separated rows, with --- Sheet: --- headers for multi-sheet workbooks
  • Support path (required) and max_chars (optional, default 50000, max 200000) parameters
  • Enforce the same security checks as sibling tools: rate limit, path allowlist, canonicalization, resolved-path check, 50 MB
    file size cap, 16 MB cumulative XML guard

Non-goals / out of scope

  • No .xls (legacy binary format) support — requires a completely different parser
  • No formula evaluation — returns raw cell values only
  • No chart, style, image, or merged-cell awareness
  • No new Cargo dependencies — reuses existing zip + quick-xml
  • No feature flag — consistent with docx_read / pptx_read registration

Alternatives considered

  • Add a third-party crate like calamine — adds a new dependency for a task achievable with existing zip + quick-xml, violating
    the project's minimal-dependency principle
  • Require users to pre-convert XLSX to CSV — poor UX and breaks autonomous agent workflows
  • Do nothing — leaves a clear gap in the document extraction family

Acceptance criteria

  • xlsx_read registered in all_tools_with_runtime and callable by the agent
  • Shared-string, numeric, boolean, and inline-string cell types correctly extracted
  • Multi-sheet workbooks produce labeled, ordered output
  • Fallback extraction works when workbook.xml is absent
  • Full security pipeline (rate limit, path policy, symlink escape, size cap, XML bomb guard) tested
  • cargo fmt, cargo clippy -D warnings, and all unit tests pass
  • Tested against a real Excel-generated XLSX file

Architecture impact

  • src/tools/xlsx_read.rs — new file (~480 lines including tests)
  • src/tools/mod.rs — module declaration, re-export, registration (3 insertions)
  • No other subsystems affected

Risk and rollback

  • Risk: Low — additive-only change, no existing behavior modified, zero new dependencies
  • Rollback: Revert the single commit; or remove the tool_arcs.push(Arc::new(XlsxReadTool::...)) line to disable without
    removing code

Breaking change?

No

Data hygiene checks

  • I removed personal/sensitive data from examples, payloads, and logs.
  • I used neutral, project-focused wording and placeholders.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions