Skip to content

Conversation

@Bento007
Copy link
Contributor

@Bento007 Bento007 commented Aug 13, 2025

Reason for Change

  • https://czi.atlassian.net/browse/VC-3784
  • When reading chunks of the parquet into bgzip, duplicate rows were created. To avoid this we can go directly from the source file to bgzip without convert to parquet. The original file will still be converted to parquet to speed up validation check.

Changes

  • When reading chunks of the parquet into bgzip, duplicate rows were created. To avoid this we can go directly from the source file to bgzip without convert to parquet. The original file will still be converted to parquet to speed up validation check.
  • The complexity of converting to bgzip has been offloaded to common linux tools.

Testing

copied a atac fragment with known duplicates

tabix https://datasets.cellxgene.cziscience.com/48826a24-569b-4fd4-ae3a-d1f1b6c2af89-fragment.tsv.bgz  chr20 | gzip > chr20_dup.tsv.gz
tabix https://datasets.cellxgene.cziscience.com/48826a24-569b-4fd4-ae3a-d1f1b6c2af89-fragment.tsv.bgz  chr20 | uniq | gzip > chr20.tsv.gz

and then ran these tests on it

import gzip
from collections import Counter
import ibis
from cellxgene_schema import atac_seq


def test_duplication_in_parquet_error(tmpdir):
    original_file = "/Users/trentsmith/workspace/single-cell-curation/chr20.tsv.gz"
    parquet_file = tmpdir / "parquet_fragment"
    atac_seq.convert_to_parquet(original_file, parquet_file)
    t = ibis.read_parquet(f"{parquet_file}/**", hive_partitioning=True)
    row_count = t.count().execute()
    unique_count = t.distinct().count().execute()
    assert row_count == unique_count, "The parquet file has duplicate rows."

    bgzip_file = tmpdir / "fragment.tsv.bgz"
    atac_seq.prepare_fragment(parquet_file, bgzip_file, tempdir=tmpdir, write_algorithm="cli")
    with gzip.open(original_file, 'rt') as f:
        original_duplicates = len([k for k, v in Counter(f).items() if v > 1])
        original_count = sum(1 for _ in f)
    with gzip.open(bgzip_file, 'rt') as f:
        bgzip_duplicates = len([k for k, v in Counter(f).items() if v > 1])
        bgzip_count = sum(1 for _ in f)
    assert original_count == bgzip_count
    assert not original_duplicates
    assert not bgzip_duplicates
    assert original_duplicates == bgzip_duplicates

  • because the size of the data required to test, this is not being added as a unit test.
  • test have been updated to reflect the changes.

Notes for Reviewer

@codecov
Copy link

codecov bot commented Aug 13, 2025

Codecov Report

❌ Patch coverage is 86.36364% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.94%. Comparing base (eb6765d) to head (e166318).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1424      +/-   ##
==========================================
- Coverage   89.17%   88.94%   -0.23%     
==========================================
  Files          23       23              
  Lines        2587     2561      -26     
==========================================
- Hits         2307     2278      -29     
- Misses        280      283       +3     
Components Coverage Δ
cellxgene_schema_cli 89.43% <86.36%> (-0.30%) ⬇️
migration_assistant 91.26% <ø> (ø)
schema_bump_dry_run_genes 79.74% <ø> (ø)
schema_bump_dry_run_ontologies 99.51% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any reproducible cases of the bug that we can add as a test case?

And do we know that there are no duplicate rows in the converted parquet file?

@Bento007
Copy link
Contributor Author

Do we have any reproducible cases of the bug that we can add as a test case?

I wasn't able to reproduce it, but narrowed down the possible area of code it could be caused.

And do we know that there are no duplicate rows in the converted parquet file?

We know there are no duplicate in the parquet file because we have an explicit check for that as part of our validation. It's only after we write it back out to bgzip that we end up with duplicates.

@Bento007 Bento007 requested review from ivirshup and joyceyan August 19, 2025 16:26
@Bento007 Bento007 enabled auto-merge (squash) August 19, 2025 20:58
@Bento007 Bento007 disabled auto-merge August 19, 2025 20:59
@Bento007 Bento007 enabled auto-merge (squash) August 19, 2025 20:59
@Bento007 Bento007 merged commit 0130ace into main Aug 19, 2025
10 of 11 checks passed
@Bento007 Bento007 deleted the tsmith/vc-3784 branch August 19, 2025 21:28
Bento007 added a commit that referenced this pull request Aug 19, 2025
Bento007 added a commit that referenced this pull request Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants