fix: resolve parquet duplication of rows #1424

Bento007 · 2025-08-13T21:33:27Z

Reason for Change

https://czi.atlassian.net/browse/VC-3784
When reading chunks of the parquet into bgzip, duplicate rows were created. To avoid this we can go directly from the source file to bgzip without convert to parquet. The original file will still be converted to parquet to speed up validation check.

Changes

When reading chunks of the parquet into bgzip, duplicate rows were created. To avoid this we can go directly from the source file to bgzip without convert to parquet. The original file will still be converted to parquet to speed up validation check.
The complexity of converting to bgzip has been offloaded to common linux tools.

Testing

copied a atac fragment with known duplicates

tabix https://datasets.cellxgene.cziscience.com/48826a24-569b-4fd4-ae3a-d1f1b6c2af89-fragment.tsv.bgz  chr20 | gzip > chr20_dup.tsv.gz
tabix https://datasets.cellxgene.cziscience.com/48826a24-569b-4fd4-ae3a-d1f1b6c2af89-fragment.tsv.bgz  chr20 | uniq | gzip > chr20.tsv.gz

and then ran these tests on it

import gzip
from collections import Counter
import ibis
from cellxgene_schema import atac_seq


def test_duplication_in_parquet_error(tmpdir):
    original_file = "/Users/trentsmith/workspace/single-cell-curation/chr20.tsv.gz"
    parquet_file = tmpdir / "parquet_fragment"
    atac_seq.convert_to_parquet(original_file, parquet_file)
    t = ibis.read_parquet(f"{parquet_file}/**", hive_partitioning=True)
    row_count = t.count().execute()
    unique_count = t.distinct().count().execute()
    assert row_count == unique_count, "The parquet file has duplicate rows."

    bgzip_file = tmpdir / "fragment.tsv.bgz"
    atac_seq.prepare_fragment(parquet_file, bgzip_file, tempdir=tmpdir, write_algorithm="cli")
    with gzip.open(original_file, 'rt') as f:
        original_duplicates = len([k for k, v in Counter(f).items() if v > 1])
        original_count = sum(1 for _ in f)
    with gzip.open(bgzip_file, 'rt') as f:
        bgzip_duplicates = len([k for k, v in Counter(f).items() if v > 1])
        bgzip_count = sum(1 for _ in f)
    assert original_count == bgzip_count
    assert not original_duplicates
    assert not bgzip_duplicates
    assert original_duplicates == bgzip_duplicates

because the size of the data required to test, this is not being added as a unit test.
test have been updated to reflect the changes.

Notes for Reviewer

codecov · 2025-08-13T21:38:14Z

Codecov Report

❌ Patch coverage is 86.36364% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.94%. Comparing base (eb6765d) to head (e166318).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1424      +/-   ##
==========================================
- Coverage   89.17%   88.94%   -0.23%     
==========================================
  Files          23       23              
  Lines        2587     2561      -26     
==========================================
- Hits         2307     2278      -29     
- Misses        280      283       +3

Components	Coverage Δ
cellxgene_schema_cli	`89.43% <86.36%> (-0.30%)`	⬇️
migration_assistant	`91.26% <ø> (ø)`
schema_bump_dry_run_genes	`79.74% <ø> (ø)`
schema_bump_dry_run_ontologies	`99.51% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ivirshup

Do we have any reproducible cases of the bug that we can add as a test case?

And do we know that there are no duplicate rows in the converted parquet file?

Bento007 · 2025-08-19T16:23:42Z

Do we have any reproducible cases of the bug that we can add as a test case?

I wasn't able to reproduce it, but narrowed down the possible area of code it could be caused.

And do we know that there are no duplicate rows in the converted parquet file?

We know there are no duplicate in the parquet file because we have an explicit check for that as part of our validation. It's only after we write it back out to bgzip that we end up with duplicates.

(cherry picked from commit 0130ace)

Bento007 mentioned this pull request Aug 13, 2025

fix: add pigz and sort to processing docker image. chanzuckerberg/single-cell-data-portal#7632

Merged

move changes to 6.0.0

2517079

Bento007 force-pushed the tsmith/vc-3784 branch from 84fe41d to 2517079 Compare August 14, 2025 21:46

Bento007 requested a review from ivirshup August 18, 2025 18:29

ivirshup reviewed Aug 18, 2025

View reviewed changes

Bento007 added 2 commits August 19, 2025 09:09

index first

154c5b0

cleanup

a9321f1

Bento007 requested review from ivirshup and joyceyan August 19, 2025 16:26

Merge branch 'main' into tsmith/vc-3784

e166318

joyceyan approved these changes Aug 19, 2025

View reviewed changes

Bento007 enabled auto-merge (squash) August 19, 2025 20:58

Bento007 disabled auto-merge August 19, 2025 20:59

Bento007 enabled auto-merge (squash) August 19, 2025 20:59

Merge branch 'main' into tsmith/vc-3784

0d25f42

Bento007 merged commit 0130ace into main Aug 19, 2025
10 of 11 checks passed

Bento007 deleted the tsmith/vc-3784 branch August 19, 2025 21:28

Bento007 added a commit that referenced this pull request Aug 19, 2025

fix: resolve parquet duplication of rows (#1424)

1da02f6

(cherry picked from commit 0130ace)

Bento007 added a commit that referenced this pull request Aug 19, 2025

fix: resolve parquet duplication of rows (#1424)

86a3abf

(cherry picked from commit 0130ace)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: resolve parquet duplication of rows #1424

fix: resolve parquet duplication of rows #1424

Uh oh!

Bento007 commented Aug 13, 2025 •

edited

Loading

Uh oh!

codecov bot commented Aug 13, 2025 •

edited

Loading

Uh oh!

ivirshup left a comment

Uh oh!

Bento007 commented Aug 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: resolve parquet duplication of rows #1424

fix: resolve parquet duplication of rows #1424

Uh oh!

Conversation

Bento007 commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reason for Change

Changes

Testing

Notes for Reviewer

Uh oh!

codecov bot commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ivirshup left a comment

Choose a reason for hiding this comment

Uh oh!

Bento007 commented Aug 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Bento007 commented Aug 13, 2025 •

edited

Loading

codecov bot commented Aug 13, 2025 •

edited

Loading