Automated Extraction of Bioactivity Data (KD, Ki, IC50, EC50)

Overview

This project presents an automated algorithm for collecting information about molecular binding measurements—KD, Ki, IC50, and EC50. The tool is designed to generate large volumes of high-quality data for machine learning model training, supporting further breakthroughs in pharmaceutical and biological research.

Background

Access to large bioactivity databases, such as BindingDB and ChEMBL, has already made a significant impact on the development of new drug technologies. These databases are widely used to train modern artificial intelligence models that predict interactions between small molecules and proteins.

However, data collection and annotation for these resources are still largely performed manually, which is challenging and time-consuming. Despite these limitations, such databases have enabled the creation of powerful AI models and numerous scientific discoveries.

Project Goals

Automate the extraction and annotation of molecular binding data (KD, Ki, IC50, EC50) from scientific literature and databases
Accelerate the creation of large, diverse datasets for machine learning and AI applications
Support ongoing research by making curated data more accessible and scalable

Key Features

Automated parsing and extraction of affinity measurements (KD, Ki, IC50, EC50)
Structured output for downstream machine learning workflows
Scalable and extensible design for integration with existing bioinformatics pipelines

Impact

Automating bioactivity data extraction will:

Greatly increase the volume and quality of training data for AI models
Lower the barrier for researchers to access and use up-to-date datasets
Enable faster and more efficient drug discovery processes

Module Descriptions

downloader - module for downloading patent text from SureChEMBL.
get_measures_from_patent - module for extracting binding measures (Ki, Kd, IC50, EC50) from patent text.
alias_to_name - module for replacing aliases with molecule names.
data_normalization - removes nulls, standardizes units to nM, parses and transforms values such as intervals.
bindingdb - module for transforming data into BindingDB format and calculating correlation.

Typical Workflow

Set up your environment (Python 3.10). Install dependencies:
```
pip install -r requirements.txt
```
Manual preparation of the patent list. (For demo purposes, you can skip this step and use example_small_ids.txt or example_big_ids.txt in the next step.)
- Download the SureChEMBL database and load it into DuckDB.
- Then run:
```
python surechembl_db_scripts/read_from_duckdb.py
```
- Then run:
```
python surechembl_db_scripts/create_patent_lists.py
```
Filter and download patents (you can use example_small_ids.txt or example_big_ids.txt for demonstration):
```
python downloader.py --input_file <patent_list> --output_dir <download_dir>
```
Agent for extracting binding metrics from patents.
- Under the hood, this agent also replaces metric aliases with molecular names (see next step):
```
python get_measure.py --patent_dirs <patent_dir> --output_dir <binding_data_dir>
```
Add InChIKey and sequence by molecule and target names, normalize data, and write the final table:
```
python bindingdb.py <binding_data_dir> <final_table>
```

Contributions and feedback are welcome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automated Extraction of Bioactivity Data (KD, Ki, IC50, EC50)

Overview

Background

Project Goals

Key Features

Impact

Module Descriptions

Typical Workflow

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
alias_to_name		alias_to_name
bindingdb		bindingdb
common_utils		common_utils
data_normalization		data_normalization
downloader		downloader
get_measures_from_patent		get_measures_from_patent
surechembl_db_scripts		surechembl_db_scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bindingdb.py		bindingdb.py
downloader.py		downloader.py
example_big_ids.txt		example_big_ids.txt
example_small_ids.txt		example_small_ids.txt
get_final_output.py		get_final_output.py
get_measure.py		get_measure.py
img.png		img.png
requirements.txt		requirements.txt

License

demonolock/surechembl_bindings

Folders and files

Latest commit

History

Repository files navigation

Automated Extraction of Bioactivity Data (KD, Ki, IC50, EC50)

Overview

Background

Project Goals

Key Features

Impact

Module Descriptions

Typical Workflow

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages