Implement lazy loading for XSSFSheet — defer sheet.xml parsing to first content access#3
Draft
Implement lazy loading for XSSFSheet — defer sheet.xml parsing to first content access#3
Conversation
- Add _worksheetLoaded, _loadLock, and _parseCount fields - OnDocumentRead() defers parsing; lazy-loads on first access - OnDocumentCreate() marks sheet as loaded immediately - Add EnsureWorksheetLoaded() with double-checked locking - Override PrepareForCommit()/Commit() to skip unloaded sheets - Add EnsureWorksheetLoaded() call to all public/internal methods and property getters/setters that access worksheet data Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Set _worksheetLoaded = true before calling Read() to prevent infinite recursion when Read() triggers callbacks (XSSFRow constructor calls OnReadCell(), LastRowNum, PhysicalNumberOfRows) that themselves call EnsureWorksheetLoaded(). Reset _worksheetLoaded = false on exception to allow retry. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tests verify that: - Opening a workbook does not parse sheet XML - GetSheetAt/GetSheet alone do not trigger parsing - First content access triggers parse exactly once - Subsequent accesses do not re-parse (_parseCount stays at 1) - Newly created in-memory sheets work without parsing - Data correctness after lazy load - Multiple sheets are independently lazy-loaded Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: tonyqus <772561+tonyqus@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Implement lazy loading for XSSF worksheets parsing
Implement Variant A lazy loading for XSSFSheet — defer sheet.xml parsing to first content access
Mar 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
XSSFWorkbookcurrently parses everysheet#.xmleagerly during open viaOnDocumentRead() → Read(). This PR defers that parse to the first meaningful sheet access, reducing open-time memory and CPU cost for workbooks where only a subset of sheets are accessed.Core mechanism (
XSSFSheet.cs)OnDocumentRead()becomes a lightweight no-op — pivot table relationships are still registered eagerly (workbook indexing requires them at open time)OnDocumentCreate()sets_worksheetLoaded = trueso programmatically-created sheets skip the lazy path entirelyEnsureWorksheetLoaded()uses double-checked locking; the flag is set beforeRead()to prevent re-entrant recursion from row/cell constructors that call back into sheet APIs:worksheet,_rows,_columns,hyperlinks,tables, etc. now callEnsureWorksheetLoaded()firstPrepareForCommit()/Commit()skip clearing/rewriting the package part when the sheet was never accessed — the original bytes are preserved in-placeSupporting fixes
XSSFChartSheet— overriddenGetCTDrawing(),GetCTLegacyDrawing(),GetCTChartsheet(), andWrite()all callEnsureWorksheetLoaded()sincechartsheetis populated by the deferredRead()Read()now guards withContains()before adding toGetWorkbook().PivotTables, sinceOnDocumentRead()already registers themWrite()/GetTables()/GetDataValidations()— addedEnsureWorksheetLoaded()(called directly byCloneSheet()and formula evaluation paths, bypassingCommit())Tests (
TestXSSFSheetLazyLoad.cs)Nine new NUnit tests verify the lazy contract:
GetSheetAt()/GetSheet()do not trigger a parse;_parseCountincrements exactly once on first row/cell/iterator access; subsequent accesses do not re-parse; unaccessed sheets in a multi-sheet workbook remain unparsed.TestReadFailsupdated:OnDocumentRead()is now intentionally a no-op, so the previous assertion that it throwsPOIXMLExceptionno longer applies.Original prompt
Implement Variant A lazy loading for XSSF worksheets in
tonyqus/npoiso thatsheet#.xmlis not parsed duringXSSFWorkbookopen, and is only parsed the first time sheet content is accessed (e.g., rows/cells/iterators/properties that requireworksheet).Goal
Optimize workbook open time and memory by deferring
sheet.xmlparsing until first meaningful interaction with anXSSFSheetinstance.Current behavior (eager parse)
XSSFWorkbook(OPCPackage)callsLoad(XSSFFactory.GetInstance()), which builds parts and invokesOnDocumentRead().XSSFSheet.OnDocumentRead()currently callsRead(GetPackagePart().GetInputStream()), which parses XML intoworksheetand initializes rows/columns.Required behavior (Variant A)
workbook.GetSheetAt(i)/GetSheet(name)/ enumerating sheets must NOT parsesheet.xml.Implementation plan
Update
ooxml/XSSF/UserModel/XSSFSheet.cs:_worksheetLoaded,_loadLock).OnDocumentRead()to not callRead(...).EnsureWorksheetLoaded()method that performs the previousRead(...)logic once (includingInitRows/InitColumns) and is safe against re-entrancy.Read(Stream)logic is reused.Add
EnsureWorksheetLoaded()calls to all relevantXSSFSheetentry points that requireworksheetor row/column initialization, including but not limited to:GetRow,CreateRow, etc.)FirstRowNum,LastRowNum,PhysicalNumberOfRows, etc.worksheetworksheet,rows,columns, hyperlinks, validations, drawings if dependent on worksheet DOM.Keep
XSSFChartSheetand other non-worksheetISheetimplementations unaffected.Add tests (NUnit) in
testcases/ooxml/XSSF/UserModel/:new XSSFWorkbook(stream)does not parse anysheet.xmlfor normal sheets before sheet content is accessed.GetSheetAt(0)alone does not trigger parse.sheet.LastRowNum,sheet.GetRow(0), or enumerating rows) triggers parse and subsequent calls do not reparse.XSSFSheet.Read(Stream)orEnsureWorksheetLoaded()(internal visibility +InternalsVisibleTofor tests, or conditional compilation) to assert parse counts.Files likely to change
ooxml/XSSF/UserModel/XSSFSheet.csworksheetfield without going through methodstestcases/ooxml/XSSF/UserModel/Acceptance criteria
sheet.xmlis parsed only on first content access (Variant A).The following is the prior conversation context from the user's chat exploration (may be truncated):
User: Optimize logic to implement lazy sheet.xml loading. Only when XSSFSheet is visit via Workbook API, NPOI will really parse the sheet.xml.
Assistant: To implement lazy
sheet.xmlloading forXSSFSheet, the key is to stop NPOI from eagerly callingXSSFSheet.OnDocumentRead()(which parsessheet.xml) during workbook open, and instead parse only on first “real” sheet access via the Workbook/Sheet APIs.What I found so far (tool results may be incomplete; you can browse more in GitHub code search UI):
XSSFSheet.OnDocumentRead()currently parsessheet.xmlimmediately by callingRead(GetPackagePart().GetInputStream()), which DOM-parses the entire sheet and then initializes rows/columns.XSSFWorkbook(OPCPackage pkg)callsLoad(XSSFFactory.GetInstance())which builds the POIXMLDocumentPart tree. That load step is likely whereOnDocumentRead()gets invoked for each part, causing eager sheet parsing.This pull request was created from Copilot chat.
📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.