iceberg: serialize all data_file fields in manifests by nvartolomei · Pull Request #29680 · redpanda-data/redpanda

nvartolomei · 2026-02-23T19:39:56Z

Serialize all data_file fields in Iceberg manifest entries. Previously,
optional fields like sort_order_id, split_offsets, equality_ids,
and the column size/count maps were stubbed out as nulls. Iceberg
implementations in the wild often do not handle missing optional fields
well, and third-party metadata rewriters that compute these fields would
have their work discarded on the next manifest rewrite by us.

https://redpandadata.atlassian.net/browse/CORE-13459

Backports Required

Release Notes

Improvements

Iceberg manifest serialization now handles all data_file fields from the v2 spec, ensuring full compatibility/no optional metadata loss during merge_append_action manifest rewriting.

Copilot

Pull request overview

This pull request enhances Iceberg manifest serialization to include all data_file fields from the v2 specification, ensuring full compatibility with manifests produced by Spark and other Iceberg engines. The PR introduces a reusable Avro comparison utility and updates test infrastructure to verify correct roundtrip serialization of all fields.

Changes:

Added comprehensive serialization support for all data_file fields including lower_bounds, upper_bounds, key_metadata, split_offsets, equality_ids, sort_order_id, and referenced_data_file
Introduced a reusable avro_comparator test utility with support for schema evolution (subset matching)
Updated field types from size_t to int64_t to match Iceberg specification
Enhanced test coverage with Spark-generated manifest data and improved test data generation

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/v/serde/avro/tests/avro_comparator.h	New utility for deep comparison of Avro GenericDatum trees with schema evolution support
src/v/serde/avro/tests/avro_comparator_test.cc	Test suite for the avro_comparator utility
src/v/serde/avro/tests/parser_test.cc	Refactored to use avro_comparator, removing duplicate comparison logic
src/v/iceberg/manifest_entry.h	Updated data_file struct to use optional fields with int64_t values
src/v/iceberg/manifest_entry.cc	Updated copy methods to handle new optional fields
src/v/iceberg/manifest_entry_values.cc	Implemented serialization/deserialization for all data_file fields
src/v/iceberg/manifest_entry_type.cc	Added referenced_data_file field to schema
src/v/iceberg/avroschemas/manifest_entry.schema.json	Added referenced_data_file field definition
src/v/iceberg/tests/manifest_serialization_test.cc	Enhanced tests to verify null/non-null field handling and Spark compatibility
src/v/iceberg/tests/gen_test_iceberg_manifest.py	Updated to use uv with PEP 723 metadata and generate test data with varied field values
src/v/iceberg/tests/testdata/*	Updated test data files and README with new generation instructions
src/v/serde/avro/tests/BUILD	Added avro_comparator library and test targets
src/v/iceberg/tests/BUILD	Added Spark manifest test data dependency

vbotbuildovich · 2026-02-23T21:25:18Z

CI test results

test results on build#80934

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
ShadowLinkTopicFailoverTests	test_link_topic_failover	{"source_cluster_spec": {"cluster_type": "redpanda"}, "with_failures": true}	integration	https://buildkite.com/redpanda/redpanda/builds/80934#019c8c1d-c6a2-4f3f-b4c7-7f2c88a127df	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0014, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkTopicFailoverTests&test_method=test_link_topic_failover
ControllerLogLimitMirrorMakerTests	test_mirror_maker_with_limits	null	integration	https://buildkite.com/redpanda/redpanda/builds/80934#019c8c1d-c69e-40b5-96f4-dc0cef3934a7	FLAKY	28/31	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0439, p0=0.3817, reject_threshold=0.0100. adj_baseline=0.1260, p1=0.2527, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ControllerLogLimitMirrorMakerTests&test_method=test_mirror_maker_with_limits
NodesDecommissioningTest	test_decommission_status	null	integration	https://buildkite.com/redpanda/redpanda/builds/80934#019c8c1d-c6a2-4f3f-b4c7-7f2c88a127df	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0485, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1387, p1=0.2247, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommission_status
RpkRedpandaStartTest	test_rpc_tls_start	null	integration	https://buildkite.com/redpanda/redpanda/builds/80934#019c8c1d-c6a1-40aa-b9e3-8a969d47f745	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0080, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpkRedpandaStartTest&test_method=test_rpc_tls_start

test results on build#80986

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": true}	integration	https://buildkite.com/redpanda/redpanda/builds/80986#019c8fd5-3602-4c2a-ad8f-5e37b3d659e2	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0183, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
ScalingUpTest	test_fast_node_addition	null	integration	https://buildkite.com/redpanda/redpanda/builds/80986#019c8fd7-2a29-4d96-8a4c-34c26f9ddd00	FLAKY	28/31	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0180, p0=0.1009, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.4114, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ScalingUpTest&test_method=test_fast_node_addition

test results on build#81047

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
ControllerLogLimitMirrorMakerTests	test_mirror_maker_with_limits	null	integration	https://buildkite.com/redpanda/redpanda/builds/81047#019c9547-b1fd-4e22-a064-d624c454031f	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0433, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1243, p1=0.2651, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ControllerLogLimitMirrorMakerTests&test_method=test_mirror_maker_with_limits

Extract the GenericDatum comparison logic from parser_test into a reusable avro_comparator.h with unit tests. - Fix copy-paste bugs in AVRO_ENUM and AVRO_FIXED that compared expected against itself instead of actual. - Check union branch index before comparing inner values; the old AVRO_UNION switch arm was dead code since type() unwraps unions. - Path-based error messages for easier debugging of nested mismatches. - Named test parameters for schema-based roundtrip tests.

Add a GenericDatum-based Avro comparison path for manifest serialization tests. This catches field loss that parse/serialize roundtrips can hide. Wire manifest tests to the shared avro comparator helper and emit mismatch diagnostics from the roundtrip loop for faster debugging.

Copilot

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 1 comment.

Make count maps (column_sizes, value_counts, null_value_counts, nan_value_counts) optional in data_file to preserve the distinction between null and empty map through Avro roundtrip. Add lower_bounds, upper_bounds, key_metadata, split_offsets, equality_ids, and sort_order_id to data_file and implement their serialization/deserialization. Update test data to cover both null and empty map cases.

Add the referenced_data_file field (field-id 143) to the data_file schema, struct, and serde to match the full Iceberg v2 spec. Add a roundtrip test using a Spark 3.5.5 / Iceberg 1.8.1 generated manifest. Make the avro comparator's extra-null-fields tolerance opt-in via an allow_extra_null_fields flag for schema evolution scenarios where the writer has more fields than the original data.

oleiman

looking good on first pass, some nits and a question about a field that got dropped from the TODO

oleiman · 2026-02-24T19:34:52Z

    size_t record_count;
    size_t file_size_bytes;


should these also be int64 as well? I think they will get assigned to one in the snapshot

Yes. Began changing them but then didn't want to get too distracted from the main task. To be done.

oleiman · 2026-02-24T19:37:07Z

+    if (fs.size() < 16) {
        throw std::invalid_argument("Expected more values");


nitpick: is it worth naming this constant and/or sticking it in the exception content? not sure the path this exception takes or whether it's at all actionable 🤷

oleiman · 2026-02-24T19:48:11Z

+        for (size_t i = 0; i < expected_map.size(); ++i) {
+            if (expected_map[i].first != actual_map[i].first) {


micro-nitpick: I think these are populated by iterating unordered maps, so maybe we should compare them in an order independent way.

Added as a separate commit.

Should have been earlier in the commit chain but some of these rebases take too long...

d14bd40

andrwng

Awesome work!

oleiman

"missing" field is deprecated

Avro maps are stored as ordered vectors of key-value pairs, but the Avro specification treats them as unordered. Different serializers (e.g. our writer vs Spark/pyiceberg) may emit the same logical map with keys in different order. Add a by_key_unique map matching mode that compares entries by key lookup instead of position, rejecting duplicate keys. This is now the default since most callers compare logically equivalent maps. The positional mode is retained for the parser roundtrip test where the random data generator can produce duplicate map keys and the test verifies exact binary-level fidelity rather than logical equivalence. Comparison options (extra_fields_policy, map_matching_policy) are grouped into a compare_options struct with sensible defaults.

nvartolomei · 2026-02-25T14:34:40Z

Change since last review round:

+++ d14bd40 (testing only)

oleiman

lgtm

oleiman · 2026-02-25T19:06:17Z

+inline std::
+  expected<std::unordered_map<std::string, size_t>, ::testing::AssertionResult>


from Imgflip Meme Generator

Copilot AI review requested due to automatic review settings February 23, 2026 19:39

github-actions Bot added area/build area/redpanda labels Feb 23, 2026

Copilot started reviewing on behalf of nvartolomei February 23, 2026 19:41 View session

Copilot AI reviewed Feb 23, 2026

View reviewed changes

Comment thread src/v/iceberg/manifest_entry_type.cc Outdated

nvartolomei force-pushed the nv/iceberg-manifest-serde branch from 894190a to d5619d1 Compare February 23, 2026 19:54

nvartolomei force-pushed the nv/iceberg-manifest-serde branch from d5619d1 to 825bcb2 Compare February 24, 2026 12:32

nvartolomei marked this pull request as draft February 24, 2026 12:32

nvartolomei added 2 commits February 24, 2026 12:34

nvartolomei force-pushed the nv/iceberg-manifest-serde branch 4 times, most recently from 6422e5b to 4c00355 Compare February 24, 2026 13:06

nvartolomei marked this pull request as ready for review February 24, 2026 13:06

nvartolomei requested a review from Copilot February 24, 2026 13:06

Copilot started reviewing on behalf of nvartolomei February 24, 2026 13:06 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

Comment thread src/v/serde/avro/tests/parser_test.cc

nvartolomei added 2 commits February 24, 2026 13:16

nvartolomei force-pushed the nv/iceberg-manifest-serde branch from 4c00355 to d455088 Compare February 24, 2026 13:16

nvartolomei requested review from andrwng, mmaslankaprv and oleiman February 24, 2026 13:16

oleiman reviewed Feb 24, 2026

View reviewed changes

andrwng previously approved these changes Feb 25, 2026

View reviewed changes

Comment thread src/v/iceberg/manifest_entry.cc

Comment thread src/v/iceberg/manifest_entry.h

oleiman previously approved these changes Feb 25, 2026

View reviewed changes

nvartolomei dismissed oleiman’s stale review via d14bd40 February 25, 2026 14:33

nvartolomei dismissed andrwng’s stale review via d14bd40 February 25, 2026 14:33

nvartolomei requested review from andrwng and oleiman February 25, 2026 14:33

oleiman approved these changes Feb 25, 2026

View reviewed changes

andrwng approved these changes Feb 25, 2026

View reviewed changes

nvartolomei merged commit 67901de into redpanda-data:dev Feb 25, 2026
20 checks passed

nvartolomei deleted the nv/iceberg-manifest-serde branch February 26, 2026 11:33

		if (fs.size() < 16) {
		throw std::invalid_argument("Expected more values");

		for (size_t i = 0; i < expected_map.size(); ++i) {
		if (expected_map[i].first != actual_map[i].first) {

		inline std::
		expected<std::unordered_map<std::string, size_t>, ::testing::AssertionResult>

Conversation

nvartolomei commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Improvements

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

vbotbuildovich commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

oleiman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

oleiman Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

nvartolomei Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

oleiman Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

oleiman Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

oleiman Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

nvartolomei Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

andrwng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

oleiman left a comment

Choose a reason for hiding this comment

Uh oh!

nvartolomei commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oleiman left a comment

Choose a reason for hiding this comment

Uh oh!

oleiman Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nvartolomei commented Feb 23, 2026 •

edited

Loading

vbotbuildovich commented Feb 23, 2026 •

edited

Loading

nvartolomei commented Feb 25, 2026 •

edited

Loading