Skip to content

3954 re2g dataset sync#177

Merged
javfg merged 8 commits intomainfrom
3954-re2g-dataset-sync
Jan 28, 2026
Merged

3954 re2g dataset sync#177
javfg merged 8 commits intomainfrom
3954-re2g-dataset-sync

Conversation

@project-defiant
Copy link
Copy Markdown
Contributor

@project-defiant project-defiant commented Jan 20, 2026

Context

This PR adds a task to download encode datasets from the HTTP server directly to the google cloud storage bucket of request.

Implementations

  • Added new dependencies for async file transfer (aiofiles, gcloud-aio-storage and aiohttp)
  • Introduced uniform way to download any encode files (relevant by the API usage)
  • Added task for the enhancerToGene source data fetch

The task to download and upload to gcs requires to fetch currently ~1.5k small files (~3MB), this can effectively leverage the asynchronous approach with reusable connection pool.

Missing

a bunch of tests that need hardcore mocking for gcs and encode server.

Testing

Tested the approach on entire dataset -> 1460 files transfered in ~ 20min from encode to gcs, for local fs it is half that time. Data post testing in gs://ot_orchestration/

See opentargets/issues#3954 for reference

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds functionality for crawling and downloading ENCODE experiment files with async HTTP and GCS support. The implementation includes a new CrawlEncode task that reads manifest files, downloads files asynchronously using aiohttp, and uploads them to either local storage or Google Cloud Storage.

Changes:

  • Added new CrawlEncode task for downloading ENCODE experiment files based on manifest files
  • Integrated async HTTP download with GCS upload support using aiohttp and gcloud-aio-storage
  • Added new dependencies (aiohttp, aiofiles, asyncio, gcloud-aio-storage, tqdm, etc.)

Reviewed changes

Copilot reviewed 5 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pyproject.toml Version bump to 26.1.0, added async/HTTP/GCS dependencies
uv.lock Lock file updates for new dependencies
src/pis/tasks/crawl_encode.py Core implementation of CrawlEncode task with async download/upload
src/pis/validators/crawl_encode.py Validation functions for file existence (local and GCS)
src/test/tasks/test_crawl_encode.py Test coverage for CrawlEncode functionality
config.yaml Added encode_rE2G step configuration, updated release_uri

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@project-defiant project-defiant marked this pull request as ready for review January 20, 2026 17:26
@project-defiant project-defiant requested a review from javfg January 20, 2026 17:26
@project-defiant
Copy link
Copy Markdown
Contributor Author

Rebase PR before merging

Copy link
Copy Markdown
Member

@javfg javfg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepting as is as we're refactoring some parts of Otter that will make this easier.

@javfg javfg merged commit 4526c68 into main Jan 28, 2026
3 checks passed
@project-defiant project-defiant linked an issue Jan 28, 2026 that may be closed by this pull request
2 tasks
@javfg javfg deleted the 3954-re2g-dataset-sync branch February 24, 2026 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rE2G dataset sync

4 participants