Commit b450bf7
Implementation[OpenhouseCommitEventTablePartitions]: Add partition-level commit event collection and publishing in TableStatsCollectionSparkApp (#402)
## Summary
<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->
I extended the existing TableStatsCollectionSparkApp to implement the
logic for populating the openhouseTableCommitEventsPartitions table.
This new table will serve as the partition-level source of truth for
commit-related metadata across all OpenHouse datasets, including:
1. Commit ID (snapshot_id)
2. Commit timestamp (committed_at)
3. Commit operation (APPEND, DELETE, OVERWRITE, REPLACE)
4. Partition data (typed column values for all partition columns)
5. Spark App ID and Spark App Name
6. Table identifier (database, table, cluster, location, partition spec)
This enables granular tracking of which partitions were affected by each
commit, providing:
1. Partition-level lineage - Track exactly which partitions changed in
each commit
2. Fine-grained auditing - Monitor data changes at partition granularity
3. Optimized queries - Query only relevant partitions for specific time
ranges
4. Incremental processing - Identify changed partitions for downstream
pipelines
## Output
This PR populates the openhouseTableCommitEventsPartitions table by
querying the Iceberg all_entries and snapshots metadata tables for all
OpenHouse datasets.
**End-to-End Verification (Docker)**
1. publishCommitEvents Log Output
```
25/12/01 12:11:22 INFO spark.TableStatsCollectionSparkApp: Publishing commit events for table:testdb.partition_stats_test
25/12/01 12:11:22 INFO spark.TableStatsCollectionSparkApp: [{"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-6f34c382-ced3-4834-813b-b40acf74b0c4","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":9208708835032390256,"commitTimestampMs":1764590975624,"commitAppId":"local-1764590946735","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1764591082431},{"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-6f34c382-ced3-4834-813b-b40acf74b0c4","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":5567407446452786456,"commitTimestampMs":1764590978518,"commitAppId":"local-1764590946735","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1764591082431},{"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-6f34c382-ced3-4834-813b-b40acf74b0c4","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":1894327931030053191,"commitTimestampMs":1764590980739,"commitAppId":"local-1764590946735","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1764591082431}]
```
Key Points:
- ✅ All 3 commit events published successfully
- ✅ commitAppId: "local-1764590946735" (populated)
- ✅ commitAppName: "Spark shell" (populated)
- ✅ commitOperation: "APPEND" (properly parsed)
2. publishPartitionEvents Log Output
```
25/12/01 20:45:54 INFO spark.TableStatsCollectionSparkApp: Publishing partition events for table: testdb.partition_stats_test
25/12/01 20:45:54 INFO spark.TableStatsCollectionSparkApp: [{"partitionData":[{"columnName":"event_time_day","value":"2024-01-03"}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-9091e45f-fd03-4f7e-9a95-a051a5e5e10f","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":4471081304043344222,"commitTimestampMs":1764621880000,"commitAppId":"local-1764621844757","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1764621954954},{"partitionData":[{"columnName":"event_time_day","value":"2024-01-01"}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-9091e45f-fd03-4f7e-9a95-a051a5e5e10f","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":5214137394193985715,"commitTimestampMs":1764621875000,"commitAppId":"local-1764621844757","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1764621954954},{"partitionData":[{"columnName":"event_time_day","value":"2024-01-02"}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-9091e45f-fd03-4f7e-9a95-a051a5e5e10f","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":7033261685039461134,"commitTimestampMs":1764621878000,"commitAppId":"local-1764621844757","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1764621954954}]
```
Key Points:
- ✅ All 3 partition events published successfully
- ✅ partitionData: Contains partition column name and values
(event_time_day: 2024-01-01, 2024-01-02, 2024-01-03)
- ✅ commitAppId: "local-1764621844757" (populated)
- ✅ commitAppName: "Spark shell" (populated)
- ✅ commitOperation: "APPEND" (properly parsed)
- ✅ Each event represents a different partition with correct metadata
3. executeWithTimingAsync (Parallel Execution) Log Output
```
25/12/01 20:45:40 INFO spark.TableStatsCollectionSparkApp: Starting table stats collection for table: testdb.partition_stats_test
25/12/01 20:45:40 INFO spark.TableStatsCollectionSparkApp: Starting commit events collection for table: testdb.partition_stats_test
25/12/01 20:45:40 INFO spark.TableStatsCollectionSparkApp: Starting partition events collection for table: testdb.partition_stats_test
25/12/01 20:45:40 INFO util.TableStatsCollectorUtil: Collecting commit events for table: openhouse.testdb.partition_stats_test (all non-expired snapshots)
25/12/01 20:45:40 INFO util.TableStatsCollectorUtil: Collecting partition-level commit events for table: openhouse.testdb.partition_stats_test
25/12/01 20:45:54 INFO util.TableStatsCollectorUtil: Collected 3 commit events for table: openhouse.testdb.partition_stats_test
25/12/01 20:45:54 INFO spark.TableStatsCollectionSparkApp: Completed commit events collection for table: testdb.partition_stats_test (3 events) in 13518 ms
25/12/01 20:45:54 INFO util.TableStatsCollectorUtil: Collected 3 partition-level commit events for table: openhouse.testdb.partition_stats_test
25/12/01 20:45:54 INFO spark.TableStatsCollectionSparkApp: Completed partition events collection for table: testdb.partition_stats_test (3 partition events) in 14109 ms
25/12/01 20:45:54 INFO spark.TableStatsCollectionSparkApp: Completed table stats collection for table: testdb.partition_stats_test in 14262 ms
25/12/01 20:45:54 INFO spark.TableStatsCollectionSparkApp: Total collection time for table: testdb.partition_stats_test in 14268 ms (parallel execution)
```
Key Points:
- ✅ All three collections started in parallel (stats, commit events,
partition events)
- ✅ Commit events collection: 13.5 seconds (3 events collected)
- ✅ Partition events collection: 14.1 seconds (3 partition events
collected)
- ✅ Table stats collection: 14.3 seconds
- ✅ Total parallel execution time: 14.3 seconds (vs ~41s if sequential)
- ✅ Job completed successfully: state SUCCEEDED
## Key Features:
1. One Row Per (Commit, Partition) Pair
- Creates one CommitEventTablePartitions record for each unique
(snapshot_id, partition) combination
- Example: 1 commit affecting 3 partitions → 3 records
2. Parallel Execution
- Runs simultaneously with table stats and commit events collection
- ~2x performance improvement over sequential execution
- Uses CompletableFuture for non-blocking parallel processing
3. Type-Safe Partition Data
- Partition values stored as typed ColumnData objects:
- LongColumnData for Integer/Long values (e.g., year=2024)
- DoubleColumnData for Float/Double values
- StringColumnData for String/Date/Timestamp values
- Runtime type detection using instanceof checks
5. Robust Error Handling
- ✅ Unpartitioned tables return empty list (no errors)
- ✅ Null values logged and skipped
- ✅ Unknown commit operations set to null with warning
- ✅ Invalid partition values logged and skipped
- ✅ Timestamp conversion handles both seconds and milliseconds
6. Stateless Design
- Processes all active (non-expired) commit-partition pairs at every job
run
- No state tracking between runs (matches existing
openhouseTableCommitEvents behavior)
- Duplicates across partitions (same commit-partition pair in multiple
event_timestamp partitions)
- Deduplication handled at query time in downstream consumers (use
DISTINCT or GROUP BY)
## Changes
- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [x] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests
For all the boxes checked, please include additional details of the
changes made in this pull request.
## Testing Done
<!--- Check any relevant boxes with "x" -->
- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [x] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.
For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.
# Additional Information
- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.
For all the boxes checked, include additional details of the changes
made in this pull request.
---------
Co-authored-by: srawat <[email protected]>1 parent 5bcfd4e commit b450bf7
File tree
6 files changed
+882
-3
lines changed- apps/spark/src
- main/java/com/linkedin/openhouse/jobs
- spark
- util
- test/java/com/linkedin/openhouse/jobs
- spark
- util
6 files changed
+882
-3
lines changedLines changed: 26 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
| 8 | + | |
8 | 9 | | |
9 | 10 | | |
10 | 11 | | |
| |||
564 | 565 | | |
565 | 566 | | |
566 | 567 | | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
567 | 593 | | |
Lines changed: 34 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| 7 | + | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
| |||
34 | 35 | | |
35 | 36 | | |
36 | 37 | | |
37 | | - | |
| 38 | + | |
38 | 39 | | |
39 | 40 | | |
40 | 41 | | |
| |||
49 | 50 | | |
50 | 51 | | |
51 | 52 | | |
52 | | - | |
53 | | - | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
54 | 61 | | |
55 | 62 | | |
56 | 63 | | |
| |||
74 | 81 | | |
75 | 82 | | |
76 | 83 | | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
77 | 94 | | |
78 | 95 | | |
79 | 96 | | |
| |||
100 | 117 | | |
101 | 118 | | |
102 | 119 | | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
103 | 134 | | |
104 | 135 | | |
105 | 136 | | |
| |||
Lines changed: 22 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
| |||
47 | 48 | | |
48 | 49 | | |
49 | 50 | | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
50 | 72 | | |
0 commit comments