iceberg: serialize all data_file fields in manifests#29680
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request enhances Iceberg manifest serialization to include all data_file fields from the v2 specification, ensuring full compatibility with manifests produced by Spark and other Iceberg engines. The PR introduces a reusable Avro comparison utility and updates test infrastructure to verify correct roundtrip serialization of all fields.
Changes:
- Added comprehensive serialization support for all data_file fields including lower_bounds, upper_bounds, key_metadata, split_offsets, equality_ids, sort_order_id, and referenced_data_file
- Introduced a reusable avro_comparator test utility with support for schema evolution (subset matching)
- Updated field types from size_t to int64_t to match Iceberg specification
- Enhanced test coverage with Spark-generated manifest data and improved test data generation
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/v/serde/avro/tests/avro_comparator.h | New utility for deep comparison of Avro GenericDatum trees with schema evolution support |
| src/v/serde/avro/tests/avro_comparator_test.cc | Test suite for the avro_comparator utility |
| src/v/serde/avro/tests/parser_test.cc | Refactored to use avro_comparator, removing duplicate comparison logic |
| src/v/iceberg/manifest_entry.h | Updated data_file struct to use optional fields with int64_t values |
| src/v/iceberg/manifest_entry.cc | Updated copy methods to handle new optional fields |
| src/v/iceberg/manifest_entry_values.cc | Implemented serialization/deserialization for all data_file fields |
| src/v/iceberg/manifest_entry_type.cc | Added referenced_data_file field to schema |
| src/v/iceberg/avroschemas/manifest_entry.schema.json | Added referenced_data_file field definition |
| src/v/iceberg/tests/manifest_serialization_test.cc | Enhanced tests to verify null/non-null field handling and Spark compatibility |
| src/v/iceberg/tests/gen_test_iceberg_manifest.py | Updated to use uv with PEP 723 metadata and generate test data with varied field values |
| src/v/iceberg/tests/testdata/* | Updated test data files and README with new generation instructions |
| src/v/serde/avro/tests/BUILD | Added avro_comparator library and test targets |
| src/v/iceberg/tests/BUILD | Added Spark manifest test data dependency |
894190a to
d5619d1
Compare
d5619d1 to
825bcb2
Compare
Extract the GenericDatum comparison logic from parser_test into a reusable avro_comparator.h with unit tests. - Fix copy-paste bugs in AVRO_ENUM and AVRO_FIXED that compared expected against itself instead of actual. - Check union branch index before comparing inner values; the old AVRO_UNION switch arm was dead code since type() unwraps unions. - Path-based error messages for easier debugging of nested mismatches. - Named test parameters for schema-based roundtrip tests.
Add a GenericDatum-based Avro comparison path for manifest serialization tests. This catches field loss that parse/serialize roundtrips can hide. Wire manifest tests to the shared avro comparator helper and emit mismatch diagnostics from the roundtrip loop for faster debugging.
6422e5b to
4c00355
Compare
Make count maps (column_sizes, value_counts, null_value_counts, nan_value_counts) optional in data_file to preserve the distinction between null and empty map through Avro roundtrip. Add lower_bounds, upper_bounds, key_metadata, split_offsets, equality_ids, and sort_order_id to data_file and implement their serialization/deserialization. Update test data to cover both null and empty map cases.
Add the referenced_data_file field (field-id 143) to the data_file schema, struct, and serde to match the full Iceberg v2 spec. Add a roundtrip test using a Spark 3.5.5 / Iceberg 1.8.1 generated manifest. Make the avro comparator's extra-null-fields tolerance opt-in via an allow_extra_null_fields flag for schema evolution scenarios where the writer has more fields than the original data.
4c00355 to
d455088
Compare
oleiman
left a comment
There was a problem hiding this comment.
looking good on first pass, some nits and a question about a field that got dropped from the TODO
| size_t record_count; | ||
| size_t file_size_bytes; |
There was a problem hiding this comment.
should these also be int64 as well? I think they will get assigned to one in the snapshot
There was a problem hiding this comment.
Yes. Began changing them but then didn't want to get too distracted from the main task. To be done.
| if (fs.size() < 16) { | ||
| throw std::invalid_argument("Expected more values"); |
There was a problem hiding this comment.
nitpick: is it worth naming this constant and/or sticking it in the exception content? not sure the path this exception takes or whether it's at all actionable 🤷
| for (size_t i = 0; i < expected_map.size(); ++i) { | ||
| if (expected_map[i].first != actual_map[i].first) { |
There was a problem hiding this comment.
micro-nitpick: I think these are populated by iterating unordered maps, so maybe we should compare them in an order independent way.
There was a problem hiding this comment.
Added as a separate commit.
Should have been earlier in the commit chain but some of these rebases take too long...
oleiman
left a comment
There was a problem hiding this comment.
"missing" field is deprecated
Avro maps are stored as ordered vectors of key-value pairs, but the Avro specification treats them as unordered. Different serializers (e.g. our writer vs Spark/pyiceberg) may emit the same logical map with keys in different order. Add a by_key_unique map matching mode that compares entries by key lookup instead of position, rejecting duplicate keys. This is now the default since most callers compare logically equivalent maps. The positional mode is retained for the parser roundtrip test where the random data generator can produce duplicate map keys and the test verifies exact binary-level fidelity rather than logical equivalence. Comparison options (extra_fields_policy, map_matching_policy) are grouped into a compare_options struct with sensible defaults.
|
Change since last review round:
|
| inline std:: | ||
| expected<std::unordered_map<std::string, size_t>, ::testing::AssertionResult> |

Serialize all
data_filefields in Iceberg manifest entries. Previously,optional fields like
sort_order_id,split_offsets,equality_ids,and the column size/count maps were stubbed out as nulls. Iceberg
implementations in the wild often do not handle missing optional fields
well, and third-party metadata rewriters that compute these fields would
have their work discarded on the next manifest rewrite by us.
https://redpandadata.atlassian.net/browse/CORE-13459
Backports Required
Release Notes
Improvements
merge_append_actionmanifest rewriting.