Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds functionality for crawling and downloading ENCODE experiment files with async HTTP and GCS support. The implementation includes a new CrawlEncode task that reads manifest files, downloads files asynchronously using aiohttp, and uploads them to either local storage or Google Cloud Storage.
Changes:
- Added new
CrawlEncodetask for downloading ENCODE experiment files based on manifest files - Integrated async HTTP download with GCS upload support using aiohttp and gcloud-aio-storage
- Added new dependencies (aiohttp, aiofiles, asyncio, gcloud-aio-storage, tqdm, etc.)
Reviewed changes
Copilot reviewed 5 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Version bump to 26.1.0, added async/HTTP/GCS dependencies |
| uv.lock | Lock file updates for new dependencies |
| src/pis/tasks/crawl_encode.py | Core implementation of CrawlEncode task with async download/upload |
| src/pis/validators/crawl_encode.py | Validation functions for file existence (local and GCS) |
| src/test/tasks/test_crawl_encode.py | Test coverage for CrawlEncode functionality |
| config.yaml | Added encode_rE2G step configuration, updated release_uri |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
Author
|
Rebase PR before merging |
javfg
approved these changes
Jan 28, 2026
Member
javfg
left a comment
There was a problem hiding this comment.
Accepting as is as we're refactoring some parts of Otter that will make this easier.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
This PR adds a task to download encode datasets from the HTTP server directly to the google cloud storage bucket of request.
Implementations
The task to download and upload to gcs requires to fetch currently ~1.5k small files (~3MB), this can effectively leverage the asynchronous approach with reusable connection pool.
Missing
a bunch of tests that need hardcore mocking for gcs and encode server.
Testing
Tested the approach on entire dataset -> 1460 files transfered in ~ 20min from encode to gcs, for local fs it is half that time. Data post testing in
gs://ot_orchestration/See opentargets/issues#3954 for reference