-
Notifications
You must be signed in to change notification settings - Fork 9
Add a converter from PDB to Zarr to the DatasetFactory #171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from 16 commits
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
77c04af
wip
cdced54
prototype
7d1e5ea
Merge branch 'main' into feat/pdb
zhu0619 c60c8c0
add fastpdb prototype
zhu0619 d7c9fd9
wip
zhu0619 2e47d70
wip
b61dfdd
wip
977b095
add pdb converter
zhu0619 3261d2d
missing imports
zhu0619 d9bce7f
minor changes
zhu0619 876138d
add tutorial
zhu0619 7e0c127
remove dev files
zhu0619 0f2c6e9
add dep
zhu0619 ce52de0
env
zhu0619 b3eafc6
update api
zhu0619 6962a31
update docs
zhu0619 6eb4bcc
add opt dep
zhu0619 0377289
update adaptor name
zhu0619 474244c
wip
zhu0619 a847d3e
update import
zhu0619 50d0b80
refactor to add_from_files
zhu0619 8edf177
refactor
zhu0619 779b91b
add tests
zhu0619 9ecae2b
update deps
zhu0619 0e7fb6e
update load_to_memeory
zhu0619 ec337c2
refactor pdb pointer
zhu0619 19ebae7
add create_dataset_from_files
zhu0619 d5c1242
add info
zhu0619 46cacb6
rename tutorials
zhu0619 f7d9dcb
fix mkdocs
zhu0619 24580fb
ruff
zhu0619 d0de43a
Merge branch 'main' into feat/pdb
zhu0619 6e8c66d
format notebooks
zhu0619 8b588d9
Revert some formatting changes
cwognum 8eda8f5
Revert some more formatting changes
cwognum b026bba
Addressed minor feedback
cwognum 569bb0b
Comment
cwognum 5222948
Merge branch 'main' into feat/pdb
zhu0619 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,3 +16,9 @@ | |
filters: ["!^_"] | ||
|
||
--- | ||
|
||
::: polaris.dataset.converters.PDBConverter | ||
options: | ||
filters: ["!^_"] | ||
|
||
--- |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,345 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "217690be-9836-4e06-930e-ba7efbb37d91", | ||
"metadata": { | ||
"editable": true, | ||
"slideshow": { | ||
"slide_type": "" | ||
}, | ||
"tags": [ | ||
"remove_cell" | ||
] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Note: Cell is tagged to not show up in the mkdocs build\n", | ||
"%load_ext autoreload\n", | ||
"%autoreload 2" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "39b58e71", | ||
"metadata": { | ||
"editable": true, | ||
"slideshow": { | ||
"slide_type": "" | ||
}, | ||
"tags": [] | ||
}, | ||
"source": [ | ||
"<div class=\"admonition abstract highlight\">\n", | ||
" <p class=\"admonition-title\">In short</p>\n", | ||
" <p>This tutorial shows how to create datasets with PDBs through the .zarr format.</p>\n", | ||
"</div>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e154bb54", | ||
"metadata": { | ||
"editable": true, | ||
"slideshow": { | ||
"slide_type": "" | ||
}, | ||
"tags": [] | ||
}, | ||
"source": [ | ||
"### Dummy PDB example" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"id": "5e201379", | ||
"metadata": { | ||
"editable": true, | ||
"slideshow": { | ||
"slide_type": "" | ||
}, | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import zarr\n", | ||
"import platformdirs\n", | ||
"\n", | ||
"import numpy as np\n", | ||
"import datamol as dm\n", | ||
"import pandas as pd\n", | ||
"\n", | ||
"from polaris.dataset import DatasetFactory\n", | ||
"from polaris.dataset.converters import SDFConverter, PDBConverter\n", | ||
"\n", | ||
"SAVE_DIR = dm.fs.join(platformdirs.user_cache_dir(appname=\"polaris-tutorials\"), \"002\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 11, | ||
"id": "14b6c3a5", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"PDB file '/Users/lu.zhu/Library/Caches/polaris-tutorials/002/tutorial.pdb' created successfully.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"pdb_content = \"\"\"\\\n", | ||
"ATOM 1 N ASN A 1 38.267 13.340 12.748 1.00 18.15 N \n", | ||
"ATOM 2 CA ASN A 1 37.251 14.218 12.226 1.00 16.56 C \n", | ||
"ATOM 3 C ASN A 1 36.022 13.500 11.637 1.00 16.50 C \n", | ||
"ATOM 4 O ASN A 1 35.023 14.079 11.216 1.00 16.60 O \n", | ||
"ATOM 5 CB ASN A 1 37.767 15.426 11.473 1.00 16.60 C \n", | ||
"TER\n", | ||
"END\n", | ||
"\"\"\"\n", | ||
"\n", | ||
"# Specify the file name\n", | ||
"pdb_filename = dm.fs.join(SAVE_DIR, \"tutorial.pdb\")\n", | ||
"\n", | ||
"# Write the string to a PDB file\n", | ||
"with open(pdb_filename, \"w\") as pdb_file:\n", | ||
" pdb_file.write(pdb_content)\n", | ||
"\n", | ||
"print(f\"PDB file '{pdb_filename}' created successfully.\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "8a47ae20", | ||
"metadata": {}, | ||
"source": [ | ||
"### Create dataset from PDB file" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 15, | ||
"id": "07442028", | ||
"metadata": { | ||
"editable": true, | ||
"slideshow": { | ||
"slide_type": "" | ||
}, | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"save_dst = dm.fs.join(SAVE_DIR, \"tutorial_pdb.zarr\")\n", | ||
"\n", | ||
"factory = DatasetFactory(zarr_root_path=save_dst)\n", | ||
"factory.reset(save_dst)\n", | ||
"\n", | ||
"factory.register_converter(\"pdb\", PDBConverter(pdb_column=\"pdb\"))\n", | ||
"factory.add_from_file([pdb_filename])\n", | ||
"\n", | ||
"# Build the dataset\n", | ||
"dataset = factory.build()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "35bb183e", | ||
"metadata": {}, | ||
"source": [ | ||
"### Check the dataset" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 17, | ||
"id": "05712cbd", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<table border=\"1\"><tr><th>name</th><td>None</td></tr><tr><th>description</th><td></td></tr><tr><th>tags</th><td></td></tr><tr><th>user_attributes</th><td></td></tr><tr><th>owner</th><td>None</td></tr><tr><th>polaris_version</th><td>0.7.10.dev7+gb61dfdd.d20240809</td></tr><tr><th>default_adapters</th><td><table border=\"1\"><tr><th>pdb</th><td>PDB_TO_ARRAY</td></tr></table></td></tr><tr><th>zarr_root_path</th><td>/Users/lu.zhu/Library/Caches/polaris-tutorials/002/tutorial_pdb.zarr</td></tr><tr><th>readme</th><td></td></tr><tr><th>annotations</th><td><table border=\"1\"><tr><th>pdb</th><td><table border=\"1\"><tr><th>is_pointer</th><td>True</td></tr><tr><th>modality</th><td>PROTEIN_3D</td></tr><tr><th>description</th><td>None</td></tr><tr><th>user_attributes</th><td></td></tr><tr><th>dtype</th><td>object</td></tr></table></td></tr></table></td></tr><tr><th>source</th><td>None</td></tr><tr><th>license</th><td>None</td></tr><tr><th>curation_reference</th><td>None</td></tr><tr><th>cache_dir</th><td>/Users/lu.zhu/Library/Caches/polaris/datasets/46c15ea7-d397-478e-a3e7-bb81752133f6</td></tr><tr><th>md5sum</th><td>9851ac3224382ee99ca8998d813d7421</td></tr><tr><th>artifact_id</th><td>None</td></tr><tr><th>n_rows</th><td>1</td></tr><tr><th>n_columns</th><td>1</td></tr></table>" | ||
], | ||
"text/plain": [ | ||
"{\n", | ||
" \"name\": null,\n", | ||
" \"description\": \"\",\n", | ||
" \"tags\": [],\n", | ||
" \"user_attributes\": {},\n", | ||
" \"owner\": null,\n", | ||
" \"polaris_version\": \"0.7.10.dev7+gb61dfdd.d20240809\",\n", | ||
" \"default_adapters\": {\n", | ||
" \"pdb\": \"PDB_TO_ARRAY\"\n", | ||
" },\n", | ||
" \"zarr_root_path\": \"/Users/lu.zhu/Library/Caches/polaris-tutorials/002/tutorial_pdb.zarr\",\n", | ||
" \"readme\": \"\",\n", | ||
" \"annotations\": {\n", | ||
" \"pdb\": {\n", | ||
" \"is_pointer\": true,\n", | ||
" \"modality\": \"PROTEIN_3D\",\n", | ||
" \"description\": null,\n", | ||
" \"user_attributes\": {},\n", | ||
" \"dtype\": \"object\"\n", | ||
" }\n", | ||
" },\n", | ||
" \"source\": null,\n", | ||
" \"license\": null,\n", | ||
" \"curation_reference\": null,\n", | ||
" \"cache_dir\": \"/Users/lu.zhu/Library/Caches/polaris/datasets/46c15ea7-d397-478e-a3e7-bb81752133f6\",\n", | ||
" \"md5sum\": \"9851ac3224382ee99ca8998d813d7421\",\n", | ||
" \"artifact_id\": null,\n", | ||
" \"n_rows\": 1,\n", | ||
" \"n_columns\": 1\n", | ||
"}" | ||
] | ||
}, | ||
"execution_count": 17, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"dataset" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e5f904bc", | ||
"metadata": {}, | ||
"source": [ | ||
"### Check data table" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 18, | ||
"id": "6b7017ad", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<div>\n", | ||
"<style scoped>\n", | ||
" .dataframe tbody tr th:only-of-type {\n", | ||
" vertical-align: middle;\n", | ||
" }\n", | ||
"\n", | ||
" .dataframe tbody tr th {\n", | ||
" vertical-align: top;\n", | ||
" }\n", | ||
"\n", | ||
" .dataframe thead th {\n", | ||
" text-align: right;\n", | ||
" }\n", | ||
"</style>\n", | ||
"<table border=\"1\" class=\"dataframe\">\n", | ||
" <thead>\n", | ||
" <tr style=\"text-align: right;\">\n", | ||
" <th></th>\n", | ||
" <th>pdb</th>\n", | ||
" </tr>\n", | ||
" </thead>\n", | ||
" <tbody>\n", | ||
" <tr>\n", | ||
" <th>0</th>\n", | ||
" <td>pdb#tutorial</td>\n", | ||
" </tr>\n", | ||
" </tbody>\n", | ||
"</table>\n", | ||
"</div>" | ||
], | ||
"text/plain": [ | ||
" pdb\n", | ||
"0 pdb#tutorial" | ||
] | ||
}, | ||
"execution_count": 18, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"dataset.table" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "a89953b8", | ||
"metadata": {}, | ||
"source": [ | ||
"### Get PDB data from specific row\n", | ||
"A array of list of `biotite.Atom` will be returned.\n", | ||
"See more details at [fastpdb](https://github.com/biotite-dev/fastpdb) and [Atom](https://github.com/biotite-dev/biotite/blob/main/src/biotite/structure/atoms.py)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 19, | ||
"id": "f2583c8d", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"array([\n", | ||
"\tAtom(np.array([38.267, 13.34 , 12.748], dtype=float32), chain_id=\"A\", res_id=1, ins_code=\"\", res_name=\"ASN\", hetero=False, atom_name=\"N\", element=\"N\", b_factor=18.15, charge=0, occupancy=1.0),\n", | ||
"\tAtom(np.array([37.251, 14.218, 12.226], dtype=float32), chain_id=\"A\", res_id=1, ins_code=\"\", res_name=\"ASN\", hetero=False, atom_name=\"CA\", element=\"C\", b_factor=16.56, charge=0, occupancy=1.0),\n", | ||
"\tAtom(np.array([36.022, 13.5 , 11.637], dtype=float32), chain_id=\"A\", res_id=1, ins_code=\"\", res_name=\"ASN\", hetero=False, atom_name=\"C\", element=\"C\", b_factor=16.5, charge=0, occupancy=1.0),\n", | ||
"\tAtom(np.array([35.023, 14.079, 11.216], dtype=float32), chain_id=\"A\", res_id=1, ins_code=\"\", res_name=\"ASN\", hetero=False, atom_name=\"O\", element=\"O\", b_factor=16.6, charge=0, occupancy=1.0),\n", | ||
"\tAtom(np.array([37.767, 15.426, 11.473], dtype=float32), chain_id=\"A\", res_id=1, ins_code=\"\", res_name=\"ASN\", hetero=False, atom_name=\"CB\", element=\"C\", b_factor=16.6, charge=0, occupancy=1.0)\n", | ||
"])" | ||
] | ||
}, | ||
"execution_count": 19, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"dataset.get_data(0, \"pdb\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "72767ef2", | ||
"metadata": { | ||
"editable": true, | ||
"slideshow": { | ||
"slide_type": "" | ||
}, | ||
"tags": [] | ||
}, | ||
"source": [ | ||
"The process of completing the dataset's metadata and uploading it to the hub follows the same steps as outlined in the tutorial [dataset_zarr.ipynb](docs/tutorials/dataset_zarr.ipynb)\n", | ||
"\n", | ||
"The End. " | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.12.2" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.