-
Notifications
You must be signed in to change notification settings - Fork 16
feat: Add support for ListView and LargeListView types #323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
GeorgeLeePatterson
wants to merge
44
commits into
apache:main
Choose a base branch
from
GeorgeLeePatterson:feat/list-view
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
feat: Add support for ListView and LargeListView types #323
GeorgeLeePatterson
wants to merge
44
commits into
apache:main
from
GeorgeLeePatterson:feat/list-view
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR adds read support for BinaryView and Utf8View types (Arrow format 1.4.0+), enabling arrow-js to consume IPC data from systems like InfluxDB 3.0 and DataFusion that use view types for efficient string handling. - Added BinaryView and Utf8View type classes with view struct layout constants - Type enum entries: Type.BinaryView = 23, Type.Utf8View = 24 - Data class support for variadic buffer management - Get visitor: Implements proper view semantics (16-byte structs, inline/out-of-line data) - Set visitor: Marks as immutable (read-only) - VectorLoader: Reads from IPC format with variadicBufferCounts - TypeComparator, TypeCtor: Type system integration - JSON visitors: Explicitly unsupported (throws error) - Generated schema files for BinaryView, Utf8View, ListView, LargeListView - Script to regenerate from Arrow format definitions - Reading BinaryView/Utf8View columns from Arrow IPC files - Accessing values with proper inline/out-of-line handling - Variadic buffer management - Type checking and comparison - ✅ Unit tests for BinaryView and Utf8View (test/unit/ipc/view-types-tests.ts) - ✅ Tests verify both inline (≤12 bytes) and out-of-line data handling - ✅ TypeScript compiles without errors - ✅ All existing tests pass - ✅ Verified with DataFusion 50.0.3 integration (enables native view types, removing need for workarounds) - Reading query results from DataFusion 50.0+ with view types enabled - Consuming InfluxDB 3.0 Arrow data with Utf8View/BinaryView columns - Processing Arrow IPC streams from any system using view types - Builders for write operations - ListView/LargeListView type implementation - Additional test coverage Closes apache#311 Related to apache#225
… from test tsconfig
Add scripts/update_flatbuffers.sh and test/unit/ipc/view-types-tests.ts to RAT (Release Audit Tool) exclusion list. Both files have proper Apache license headers but need to be excluded from license scanning.
This reverts commit dfe9d56.
Remove blank line after shebang to match Apache Arrow JS convention. License header must start on line 2 with '#' as shown in ci/scripts/build.sh
Add BinaryView and Utf8View to main exports in Arrow.ts. These types were implemented but not exported, causing 'BinaryView is not a constructor' errors in ES5 UMD tests.
Add BinaryView and Utf8View to Arrow.dom.ts exports. Arrow.node.ts re-exports from Arrow.dom.ts, so this fixes both entrypoints.
- Simplify variadicBuffers byteLength calculation with reduce - Remove unsupported type enum entries (only add BinaryView and Utf8View) - Eliminate type casting by extracting getBinaryViewBytes helper - Simplify readVariadicBuffers with Array.from - Remove CompressedVectorLoader override (inherits base implementation) - Delete SparseTensor.ts (not implementing tensors in this PR)
- Implement BinaryViewBuilder with inline/out-of-line storage logic - Implement Utf8ViewBuilder with UTF-8 encoding support - Support random-access writes (not just append-only) - Proper variadic buffer management (32MB buffers per spec) - Handle null values correctly - Register builders in builderctor visitor - Add comprehensive test suite covering: - Inline values (≤12 bytes) - Out-of-line values (>12 bytes) - Mixed inline/out-of-line - Null values - Empty values - 12-byte boundary cases - UTF-8 multibyte characters - Large batches (1000 values) - Multiple flushes Fixes: - Correct buffer allocation for random-access writes - Proper byteLength calculation (no double-counting) - Follows FixedWidthBuilder patterns for index-based writes
ESLint rule jest/prefer-to-have-length requires using toHaveLength() instead of toBe() for length checks.
Use reduce instead of explicit loops for variadicBuffers byteLength calculation, consistent with changes in Data class.
- Add ListView and LargeListView type classes with child field support - Add type guard methods isListView and isLargeListView - Add visitor support in typeassembler and typector - Add Data interfaces for ListView with offsets and sizes buffers - Add makeData overloads for ListView and LargeListView - Update DataProps union type to include ListView types ListView and LargeListView use offset+size buffers instead of consecutive offsets, allowing out-of-order writes and value sharing.
- Add ListView and LargeListView type classes to src/type.ts - Add visitor support in src/visitor.ts (inferDType and getVisitFnByTypeId) - Add visitor support in src/visitor/typector.ts and typeassembler.ts - Add DataProps interfaces for ListView/LargeListView in src/data.ts - Implement MakeDataVisitor methods for ListView/LargeListView - Implement GetVisitor methods for ListView/LargeListView in src/visitor/get.ts - Add comprehensive test suite in test/unit/ipc/list-view-tests.ts - Tests in-order and out-of-order offsets - Tests value sharing between list elements - Tests null handling and empty lists - Tests LargeListView with BigInt64Array offsets - Tests type properties ListView and LargeListView are Arrow 1.4 variable-size list types that use offset+size buffers instead of consecutive offsets, enabling out-of-order writes and value sharing.
Add type 25 (ListView) and 26 (LargeListView) to the Type enum.
Implements builders for ListView and LargeListView types: - ListViewBuilder: Uses Int32Array for offsets and sizes - LargeListViewBuilder: Uses BigInt64Array for offsets and sizes Key implementation details: - Both builders extend Builder directly (not VariableWidthBuilder) - Use DataBufferBuilder for independent offset and size buffers - Override flush() to pass both valueOffsets and sizes to makeData - Properly handle null values and empty lists Includes comprehensive test suite with 11 passing tests: - Basic value appending - Null handling - Empty lists - Multiple flushes - Varying list sizes - BigInt offset verification This is part of the stacked PR strategy for view types support.
ESLint rule jest/prefer-to-have-length requires using toHaveLength() instead of toBe() for length checks.
bf86b5b to
7e89388
Compare
Add patch file to remove .skip_tester('JS') for BinaryView tests
and modify CI workflow to apply the patch before running Archery.
This enables the official Apache Arrow integration tests to validate
BinaryView and Utf8View support in arrow-js.
Fixes RAT (Release Audit Tool) license check failure.
The integration tests require JSON format support for cross-implementation validation. This adds recognition of 'binaryview' and 'utf8view' type names in the JSON type parser. Fixes integration test failures where arrow-js couldn't parse BinaryView/Utf8View types from JSON schema definitions.
The JSONVectorLoader needs to read variadic buffers from JSON format to support BinaryView and Utf8View types in integration tests. This method reads hex-encoded variadic buffer data from JSON sources and converts it to Uint8Array buffers.
This commit implements complete JSON integration test support for BinaryView and Utf8View types by adding handling for variadic data buffers. Changes: - Updated buffersFromJSON() to handle VIEWS and VARIADIC_DATA_BUFFERS fields - Added variadicBufferCountsFromJSON() using reduce pattern to extract counts - Updated recordBatchFromJSON() to pass variadicBufferCounts to RecordBatch - Updated JSONVectorLoader constructor to accept and pass variadicBufferCounts - Updated RecordBatchJSONReaderImpl to pass variadicBufferCounts to loader
Implements viewDataFromJSON() to convert JSON view objects into 16-byte view
structs required by the Arrow view format.
The JSON VIEWS field contains objects with structure:
- Inline views (≤12 bytes): {SIZE, INLINED}
- Out-of-line views (>12 bytes): {SIZE, PREFIX_HEX, BUFFER_INDEX, OFFSET}
This function converts these to the binary view struct layout:
[size: i32, prefix/inlined: 12 bytes, buffer_index: i32, offset: i32]
Changes:
- Added viewDataFromJSON() helper function
- Updated JSONVectorLoader.readData() to handle BinaryView and Utf8View types
- Properly constructs 16-byte view structs from JSON representation
…riter)
Implements JSON writing for BinaryView and Utf8View types to enable 'JS producing'
integration tests. This completes the JSON format support for view types.
Implementation:
- Added visitBinaryView() and visitUtf8View() methods to JSONVectorAssembler
- Implemented viewDataToJSON() helper that converts 16-byte view structs to JSON
- Handles both inline (≤12 bytes) and out-of-line (>12 bytes) views
- Properly maps variadic buffer indices and converts buffers to hex strings
JSON output format matches Apache Arrow spec:
- Inline views: {SIZE, INLINED} where INLINED is hex (BinaryView) or string (Utf8View)
- Out-of-line views: {SIZE, PREFIX_HEX, BUFFER_INDEX, OFFSET}
- VARIADIC_DATA_BUFFERS array contains hex-encoded buffer data
This enables the complete roundtrip:
Builder → Data → JSON → IPC → validation
…into feat/binary-utf8-view
- Add SIZE buffer parsing in buffersFromJSON for ListView/LargeListView - Implement visitListView and visitLargeListView in JSONVectorAssembler - Fix sizes property access (use data.values instead of data.sizes) - Merge feat/list-view-builders branch for complete integration test support This enables ListView/LargeListView to work with Arrow JSON format for integration testing, similar to the BinaryView/Utf8View implementation.
This fixes integration test failures for BinaryView and Utf8View types. Changes: - Fix JSONTypeAssembler to serialize BinaryView/Utf8View type metadata - Fix JSONMessageReader to include VIEWS and VARIADIC_DATA_BUFFERS in sources - Fix viewDataFromJSON to handle both hex (BinaryView) and UTF-8 (Utf8View) INLINED formats - Fix readVariadicBuffers to handle individual hex strings correctly - Fix lint error: use String.fromCodePoint() instead of String.fromCharCode() - Fix lint error: use for-of loop instead of traditional for loop - Add comprehensive unit tests for JSON round-trip serialization Root cause: The JSON format uses different representations for inline data: - BinaryView INLINED: hex string (e.g., "48656C6C6F") - Utf8View INLINED: UTF-8 string (e.g., "Hello") The reader now auto-detects the format and handles both correctly. Fixes apache#320 integration test failures
- Extract hexStringToBytes() helper function to reduce code duplication - Update readVariadicBuffers() to use helper instead of wrapping in array - Update binaryDataFromJSON() to use helper for cleaner implementation - Add comprehensive documentation explaining design matches C++ reference - Document why 'as unknown as string' cast is necessary for heterogeneous sources array - Reference Arrow C++ implementation in comments for architectural clarity
When reading BinaryView/Utf8View data, ensure the DataView length doesn't exceed available buffer bounds. This fixes 'Invalid DataView length 16' errors that occur when the underlying buffer has less than 16 bytes available at the offset position. Fixes test failures in ES5 UMD build where view data deserialization was failing with RangeError.
Fixed critical bugs preventing BinaryView and Utf8View types from working correctly in ES5 UMD builds due to Google Closure Compiler advanced optimizations. Key fixes: 1. **Property access in data.ts** (visitBinaryView/visitUtf8View): - Changed from bracket notation (props['views']) to dot notation (props.views) - Closure Compiler was mangling property names when accessed via brackets - Dot notation allows consistent property renaming throughout compilation 2. **Property access in vectorloader.ts** (viewDataFromJSON): - Changed JSON property access from dot to bracket notation - Properties like SIZE, INLINED, PREFIX_HEX, etc. come from JSON - Must use bracket notation to access raw string keys from JSON 3. **Builder flush method** (BinaryViewBuilder/Utf8ViewBuilder): - Added this.clear() call at end of flush() to reset builder state - Matches pattern used by other builders (e.g., VariableWidthBuilder) - Fixes issue where multiple flush calls would accumulate length 4. **Buffer resize strategy**: - Changed from subarray() to slice() in resizeArray function - Creates copy instead of view to prevent issues with buffer reuse - Ensures flushed buffers are independent of builder state Results: - ✅ Builder pattern works correctly - ✅ vectorFromArray creates proper BinaryView/Utf8View vectors - ✅ JSON serialization/deserialization round-trips successfully - ✅ Multiple flush cycles work correctly Remaining test failures: - 2 integration tests fail only in ES5 UMD gulp tests - These tests call makeData() directly from test code with object literals - Property names get mangled differently between test code and library code - Same tests PASS with jest (no Closure Compiler involved) - All real-world usage patterns work correctly
The integration tests were calling makeData() directly from test code, which is incompatible with Google Closure Compiler's property name mangling in UMD builds. Changed tests to use vectorFromArray() which keeps all code within the same compilation unit. All unit tests now pass in all targets (ES5, ES2015, ESNext) and all module formats (CJS, ESM, UMD). Integration tests verified locally with archery and pass successfully.
…y access notation - Revert buffer.ts resizeArray() to use subarray() instead of slice() for performance - Fix data.ts visitUtf8View and visitBinaryView to use dot notation in destructuring for Closure Compiler compatibility
…y access notation - Revert buffer.ts resizeArray() to use subarray() instead of slice() for performance - Fix data.ts visitUtf8View and visitBinaryView to use dot notation in destructuring for Closure Compiler compatibility
- Finishes implementation for Utf8View and BinaryView across JSON read/write paths - Patches bugs discovered from previous commits - Ensures property access is UMD friendly - Removes ad-hocs tests and incorporates new types into existing test infrastructure All passing locally: 1. Lint checks 2. Builds across all targets 3. All unit tests against all targets 4. All bundle tests 5. Integration tests
- Adds JSON and IPC support for read/write - Integrates changes into existing test framework
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What's Changed
This PR adds read support for ListView and LargeListView types (Arrow format 1.4.0+), which provide variable-length list semantics with explicit size tracking for improved slicing performance.
Implementation Details
Core Type Support
Visitor Pattern
What Works
Testing
Builds on #320