Skip to content

Implement lazy loading for XSSFSheet — defer sheet.xml parsing to first content access#3

Draft
Copilot wants to merge 6 commits intomasterfrom
copilot/implement-lazy-loading-xssf-sheets
Draft

Implement lazy loading for XSSFSheet — defer sheet.xml parsing to first content access#3
Copilot wants to merge 6 commits intomasterfrom
copilot/implement-lazy-loading-xssf-sheets

Conversation

Copy link

Copilot AI commented Mar 20, 2026

XSSFWorkbook currently parses every sheet#.xml eagerly during open via OnDocumentRead() → Read(). This PR defers that parse to the first meaningful sheet access, reducing open-time memory and CPU cost for workbooks where only a subset of sheets are accessed.

Core mechanism (XSSFSheet.cs)

  • OnDocumentRead() becomes a lightweight no-op — pivot table relationships are still registered eagerly (workbook indexing requires them at open time)
  • OnDocumentCreate() sets _worksheetLoaded = true so programmatically-created sheets skip the lazy path entirely
  • New thread-safe EnsureWorksheetLoaded() uses double-checked locking; the flag is set before Read() to prevent re-entrant recursion from row/cell constructors that call back into sheet APIs:
_worksheetLoaded = true;        // guard against re-entrant calls inside Read()
bool success = false;
try {
    Read(GetPackagePart().GetInputStream());
    _parseCount++;
    success = true;
} catch (IOException e) {
    throw new POIXMLException(e);
} finally {
    if (!success) _worksheetLoaded = false;   // allow retry on failure
}
  • ~190 public/internal entry points that touch worksheet, _rows, _columns, hyperlinks, tables, etc. now call EnsureWorksheetLoaded() first
  • PrepareForCommit() / Commit() skip clearing/rewriting the package part when the sheet was never accessed — the original bytes are preserved in-place

Supporting fixes

  • XSSFChartSheet — overridden GetCTDrawing(), GetCTLegacyDrawing(), GetCTChartsheet(), and Write() all call EnsureWorksheetLoaded() since chartsheet is populated by the deferred Read()
  • Pivot table double-addRead() now guards with Contains() before adding to GetWorkbook().PivotTables, since OnDocumentRead() already registers them
  • Write() / GetTables() / GetDataValidations() — added EnsureWorksheetLoaded() (called directly by CloneSheet() and formula evaluation paths, bypassing Commit())

Tests (TestXSSFSheetLazyLoad.cs)

Nine new NUnit tests verify the lazy contract: GetSheetAt() / GetSheet() do not trigger a parse; _parseCount increments exactly once on first row/cell/iterator access; subsequent accesses do not re-parse; unaccessed sheets in a multi-sheet workbook remain unparsed.

TestReadFails updated: OnDocumentRead() is now intentionally a no-op, so the previous assertion that it throws POIXMLException no longer applies.

Original prompt

Implement Variant A lazy loading for XSSF worksheets in tonyqus/npoi so that sheet#.xml is not parsed during XSSFWorkbook open, and is only parsed the first time sheet content is accessed (e.g., rows/cells/iterators/properties that require worksheet).

Goal

Optimize workbook open time and memory by deferring sheet.xml parsing until first meaningful interaction with an XSSFSheet instance.

Current behavior (eager parse)

XSSFWorkbook(OPCPackage) calls Load(XSSFFactory.GetInstance()), which builds parts and invokes OnDocumentRead().
XSSFSheet.OnDocumentRead() currently calls Read(GetPackagePart().GetInputStream()), which parses XML into worksheet and initializes rows/columns.

Required behavior (Variant A)

  • Calling workbook.GetSheetAt(i) / GetSheet(name) / enumerating sheets must NOT parse sheet.xml.
  • Accessing sheet content/structure (rows/cells/iteration/row & column-dependent properties) must trigger parsing exactly once.

Implementation plan

  1. Update ooxml/XSSF/UserModel/XSSFSheet.cs:

    • Add private fields to track load state and for synchronization (e.g., _worksheetLoaded, _loadLock).
    • Change OnDocumentRead() to not call Read(...).
    • Add an internal/private EnsureWorksheetLoaded() method that performs the previous Read(...) logic once (including InitRows/InitColumns) and is safe against re-entrancy.
    • Ensure that existing Read(Stream) logic is reused.
  2. Add EnsureWorksheetLoaded() calls to all relevant XSSFSheet entry points that require worksheet or row/column initialization, including but not limited to:

    • row access/creation (GetRow, CreateRow, etc.)
    • enumerators/iteration over rows and cells
    • properties like FirstRowNum, LastRowNum, PhysicalNumberOfRows, etc.
    • merged regions APIs if they rely on worksheet
    • any methods that touch worksheet, rows, columns, hyperlinks, validations, drawings if dependent on worksheet DOM.
  3. Keep XSSFChartSheet and other non-worksheet ISheet implementations unaffected.

  4. Add tests (NUnit) in testcases/ooxml/XSSF/UserModel/:

    • Verify that new XSSFWorkbook(stream) does not parse any sheet.xml for normal sheets before sheet content is accessed.
    • Verify that GetSheetAt(0) alone does not trigger parse.
    • Verify that the first call to a row-dependent API (e.g., sheet.LastRowNum, sheet.GetRow(0), or enumerating rows) triggers parse and subsequent calls do not reparse.
    • Consider using a hook/counter in XSSFSheet.Read(Stream) or EnsureWorksheetLoaded() (internal visibility + InternalsVisibleTo for tests, or conditional compilation) to assert parse counts.

Files likely to change

  • ooxml/XSSF/UserModel/XSSFSheet.cs
  • possibly other XSSF usermodel files if they directly access internal worksheet field without going through methods
  • new/updated tests under testcases/ooxml/XSSF/UserModel/

Acceptance criteria

  • Existing tests pass.
  • New tests demonstrate lazy behavior.
  • No functional regression in sheet operations.
  • sheet.xml is parsed only on first content access (Variant A).

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Optimize logic to implement lazy sheet.xml loading. Only when XSSFSheet is visit via Workbook API, NPOI will really parse the sheet.xml.

Assistant: To implement lazy sheet.xml loading for XSSFSheet, the key is to stop NPOI from eagerly calling XSSFSheet.OnDocumentRead() (which parses sheet.xml) during workbook open, and instead parse only on first “real” sheet access via the Workbook/Sheet APIs.

What I found so far (tool results may be incomplete; you can browse more in GitHub code search UI):

  • XSSFSheet.OnDocumentRead() currently parses sheet.xml immediately by calling Read(GetPackagePart().GetInputStream()), which DOM-parses the entire sheet and then initializes rows/columns.
    internal override void OnDocumentRead()
    {
        try
        {
            Read(GetPackagePart().GetInputStream());
        }
        catch(IOException e)
        {
            throw new POIXMLException(e);
        }
    }
    
    internal virtual void Read(Stream is1)
    {
        XmlDocument doc = ConvertStreamToXml(is1);
        worksheet = WorksheetDocument.Parse(doc, NamespaceManager).GetWorksheet();
        InitRows(worksheet);
        InitColumns(worksheet);
    }
  • XSSFWorkbook(OPCPackage pkg) calls Load(XSSFFactory.GetInstance()) which builds the POIXMLDocumentPart tree. That load step is likely where OnDocumentRead() gets invoked for each part, causing eager sheet parsing.

This pull request was created from Copilot chat.


📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.

Copilot AI and others added 4 commits March 20, 2026 12:39
- Add _worksheetLoaded, _loadLock, and _parseCount fields
- OnDocumentRead() defers parsing; lazy-loads on first access
- OnDocumentCreate() marks sheet as loaded immediately
- Add EnsureWorksheetLoaded() with double-checked locking
- Override PrepareForCommit()/Commit() to skip unloaded sheets
- Add EnsureWorksheetLoaded() call to all public/internal methods
  and property getters/setters that access worksheet data

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Set _worksheetLoaded = true before calling Read() to prevent
infinite recursion when Read() triggers callbacks (XSSFRow
constructor calls OnReadCell(), LastRowNum, PhysicalNumberOfRows)
that themselves call EnsureWorksheetLoaded().

Reset _worksheetLoaded = false on exception to allow retry.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tests verify that:
- Opening a workbook does not parse sheet XML
- GetSheetAt/GetSheet alone do not trigger parsing
- First content access triggers parse exactly once
- Subsequent accesses do not re-parse (_parseCount stays at 1)
- Newly created in-memory sheets work without parsing
- Data correctness after lazy load
- Multiple sheets are independently lazy-loaded

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: tonyqus <772561+tonyqus@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement lazy loading for XSSF worksheets parsing Implement Variant A lazy loading for XSSFSheet — defer sheet.xml parsing to first content access Mar 20, 2026
Copilot AI requested a review from tonyqus March 20, 2026 13:09
@tonyqus tonyqus changed the title Implement Variant A lazy loading for XSSFSheet — defer sheet.xml parsing to first content access Implement lazy loading for XSSFSheet — defer sheet.xml parsing to first content access Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants