Skip to content

Conversation

@mlin
Copy link
Contributor

@mlin mlin commented Jan 31, 2024

Sorry for the heavyweight PR -- numerous changes to our Geneformer API and workflows accumulated for the new LTS:

  • Run tokenization/fine-tuning/forward-pass WDLs on AWS HealthOmics instead of Batch
  • Update the upstream Geneformer version
    • Add new special_token flag
    • Use a gene ID consolidation mapping, with modifications to the sparse math to implement
    • Add WDL inputs for slight variations on embeddings we want to try
  • Update ontologies for cell subclasses

Noting loose ends for potential future cleanup:

  • Use published version of new Geneformer model, once available
  • Replace legacy cell subclass mapper with cellxgene-ontology-guide

@codecov
Copy link

codecov bot commented Jan 31, 2024

Codecov Report

Attention: Patch coverage is 72.54902% with 14 lines in your changes missing coverage. Please review.

Project coverage is 91.12%. Comparing base (f775282) to head (e12a102).
Report is 2 commits behind head on main.

Files Patch % Lines
...xperimental/ml/huggingface/geneformer_tokenizer.py 72.34% 13 Missing ⚠️
...sts/experimental/ml/huggingface/test_geneformer.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #961      +/-   ##
==========================================
- Coverage   91.19%   91.12%   -0.07%     
==========================================
  Files          77       79       +2     
  Lines        5971     6173     +202     
==========================================
+ Hits         5445     5625     +180     
- Misses        526      548      +22     
Flag Coverage Δ
unittests 91.12% <72.54%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mlin mlin changed the title [python] run Geneformer WDLs on HealthOmics managed service instead of AWS Batch [python] Geneformer updates for July 2024 LTS Jul 3, 2024
@mlin mlin marked this pull request as ready for review July 3, 2024 08:06
@mlin mlin requested review from ebezzi, ivirshup and pablo-gar July 3, 2024 08:07
@mlin mlin merged commit 1b24d78 into main Jul 5, 2024
@mlin mlin deleted the mlin/geneformer-healthomics branch July 5, 2024 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants