This project presents an automated algorithm for collecting information about molecular binding measurements—KD, Ki, IC50, and EC50. The tool is designed to generate large volumes of high-quality data for machine learning model training, supporting further breakthroughs in pharmaceutical and biological research.
Access to large bioactivity databases, such as BindingDB and ChEMBL, has already made a significant impact on the development of new drug technologies. These databases are widely used to train modern artificial intelligence models that predict interactions between small molecules and proteins.
However, data collection and annotation for these resources are still largely performed manually, which is challenging and time-consuming. Despite these limitations, such databases have enabled the creation of powerful AI models and numerous scientific discoveries.
- Automate the extraction and annotation of molecular binding data (KD, Ki, IC50, EC50) from scientific literature and databases
- Accelerate the creation of large, diverse datasets for machine learning and AI applications
- Support ongoing research by making curated data more accessible and scalable
- Automated parsing and extraction of affinity measurements (KD, Ki, IC50, EC50)
- Structured output for downstream machine learning workflows
- Scalable and extensible design for integration with existing bioinformatics pipelines
Automating bioactivity data extraction will:
- Greatly increase the volume and quality of training data for AI models
- Lower the barrier for researchers to access and use up-to-date datasets
- Enable faster and more efficient drug discovery processes
- downloader - module for downloading patent text from SureChEMBL.
- get_measures_from_patent - module for extracting binding measures (Ki, Kd, IC50, EC50) from patent text.
- alias_to_name - module for replacing aliases with molecule names.
- data_normalization - removes nulls, standardizes units to nM, parses and transforms values such as intervals.
- bindingdb - module for transforming data into BindingDB format and calculating correlation.
-
Set up your environment (Python 3.10). Install dependencies:
pip install -r requirements.txt
-
Manual preparation of the patent list. (For demo purposes, you can skip this step and use
example_small_ids.txt
orexample_big_ids.txt
in the next step.)-
Download the SureChEMBL database and load it into DuckDB.
-
Then run:
python surechembl_db_scripts/read_from_duckdb.py
-
Then run:
python surechembl_db_scripts/create_patent_lists.py
-
-
Filter and download patents (you can use
example_small_ids.txt
orexample_big_ids.txt
for demonstration):python downloader.py --input_file <patent_list> --output_dir <download_dir>
-
Agent for extracting binding metrics from patents.
- Under the hood, this agent also replaces metric aliases with molecular names (see next step):
python get_measure.py --patent_dirs <patent_dir> --output_dir <binding_data_dir>
-
Add InChIKey and sequence by molecule and target names, normalize data, and write the final table:
python bindingdb.py <binding_data_dir> <final_table>
Contributions and feedback are welcome!