OpenPlasmid #23

wconnell · 2024-10-24T22:13:37Z

wconnell
Oct 24, 2024

Polaris Link

https://polarishub.io/datasets/wconnell/openplasmid-v1

README

Overview

OpenPlasmid is a comprehensive dataset containing detailed information on approximately 150,000 plasmids originally deposited on Addgene, a nonprofit plasmid repository. This dataset includes textual descriptions, depositor and study information, annotated GenBank sequences, and more, facilitating research in molecular biology, genetics, and drug discovery. You can find rich educational resources about plasmids and their use in molecular biology and therapeutics on Addgene's blog webpage.

Applications in Drug Discovery

Target Identification: Identifying plasmids expressing proteins of interest for therapeutic targeting.
Gene Therapy Research: Analyzing plasmids used in gene therapy vectors.
Protein Production: Studying plasmids for recombinant protein expression in drug development.
Synthetic Biology: Exploring custom plasmid constructs for novel biological systems and pathways.
CRISPR and Gene Editing: Accessing plasmids containing CRISPR/Cas components for genome editing.
Generative design: Design of plasmid sequences and motifs that improve functional use.

Data Source and Generation

Source: Exclusively sourced from Addgene for consistency and reliability.
Collection Method: Programmatically extracted, see the polaris recipe.

Data Curation and Quality Assurance

Consistency and Standardization

Schema Definition: Established a consistent schema with predefined fields and subfields to ensure uniformity across all entries.
Data Cleaning: Implemented text cleaning functions to remove excessive whitespace, HTML tags, and unwanted content.
Type Conversion: Converted all data values to appropriate data types, primarily strings, to maintain consistency.

Error Checking

Duplicate Removal: Checked for and removed any duplicate plasmid entries based on unique Addgene IDs.
Validation of Data Fields: Ensured all required fields are present. Missing fields were populated with empty strings to maintain schema integrity.
Data Verification: Cross-referenced extracted data with the original Addgene webpages to verify accuracy.

Dataset Schema

Each plasmid sample is structured either as a string or a dictionary object with consistent nested fields for simple parsing. The metadata is gathered from webpage annotations, see for example Plasmid #48138. The unique field that can be loaded for rich sequence feature annotations is GenBank Raw.

Core Fields

Name (string): Plasmid's name.
ID (string): Unique Addgene identifier.
Purpose (string): Function of the plasmid.
Depositing Lab (string): Lab or researcher who deposited the plasmid.
Flame (string): Popularity status. Options: "High", "Medium", "Low".
GenBank File (string): URL to download GenBank file.
Sequence Type (string): Sequence completeness. Options: "full", "partial".
GenBank Raw (string): Raw GenBank sequence as text.

Nested Fields

Backbone (object): Information about the plasmid backbone.
- Vector Backbone (string): Name of the vector backbone.
- Backbone Size w/o Insert (bp) (string): Size of the backbone without inserts.
- Vector Type (string): Type of vector.
- Selectable Markers (string): Resistance or selection markers.
Gene/Insert X (object): Information about each gene or insert. Entries 1-3 can exist. Few plasmids have more than a 1-3 gene inserts; for those that do this information is not collected. However, full gene/insert information can be accessed by directly parsing the GenBank Raw column.
- Gene/Insert Name (string): Name of the gene/insert.
- Species (string): Origin species.
- Insert Size (bp) (string): Size of the insert.
- Mutation (string): Mutation details.
- Entrez Gene (string): Link or ID to Entrez Gene.
- Promoter (string): Promoter sequence.
- Tag / Fusion Protein (string): Tags or fusion proteins.
Cloning Information for Gene/Insert X (object): Cloning details for each insert.
- Cloning Method (string): Method used for cloning.
- 5′ Cloning Site (string): 5' restriction site.
- 3′ Cloning Site (string): 3' restriction site.
- 5′ Sequencing Primer (string): 5' sequencing primer.
- 3′ Sequencing Primer (string): 3' sequencing primer.
Growth in Bacteria (object): Information about bacterial growth.
- Bacterial Resistance(s) (string): Resistance markers.
- Growth Temperature (string): Optimal temperature for growth.
- Growth Strain(s) (string): Recommended strains.
- Copy Number (string): Plasmid copy number.
Terms and Licenses (object): Licensing information. Only sequences available for academic and nonprofit use were scraped. No sequences were obtained for plasmid's that require acceptance of Addgene's Affinity Reagent Sequence Policy. Please see Addgene's detailed Terms of Use.
References (object): Bibliographic information.
- Title (string): Title of the publication.
- Authors (string): List of authors.
- Publication (string): Journal or source.
- DOI (string): Digital Object Identifier.
- PubMed Link (string): URL to PubMed entry.
- PubMed ID (string): Unique PubMed identifier.

Additional Fields

Depositor Comments (string): Notes or comments from the depositor.
Notes (string): Any other relevant information.

How to extract richly-annotated GenBank sequences

from Bio import SeqIO
from io import StringIO
import polaris as po

def read_genbank(record: str):
    return SeqIO.read(StringIO(record), "genbank")

dataset = po.load_dataset("wconnell/openplasmid-v1")
gb_raw = dataset.get_data(row=0, col='GenBank Raw')
gb_parsed = read_genbank(gb_raw)

Dataset Source

Addgene

Dataset Curation

https://github.com/polaris-hub/polaris-recipes/org-OpenPlasmid/

Dataset Completeness

I confirm that I filled out at least the readme, source and curation_reference fields for my Polaris dataset.

Anything else we should know?

No response

wconnell · 2024-10-24T22:17:34Z

wconnell
Oct 24, 2024
Author

I opened a PR for the dataset build logic recipe: #24

0 replies

zhu0619 · 2024-11-04T15:55:33Z

zhu0619
Nov 4, 2024
Maintainer

@wconnell Thank you for the submission.
We are in the process of reviewing the dataset to ensure it follows Polaris criteria. We will update on any significant findings or necessary adjustments once the review is complete.

0 replies

stwhitfield · 2024-11-05T13:56:00Z

stwhitfield
Nov 5, 2024

Thanks for your submission! I've been tagged as a domain expert to provide feedback.

Applications in real-world drug discovery:
Plasmids are circular pieces of DNA that we can use to tell bacteria, yeast, or mammalian cells to make specific proteins of interest. As such, they can be used to manufacture monoclonal antibodies, cell-based therapeutics, mRNA-based vaccines and more, so are generally relevant to drug discovery. This dataset is a potentially useful resource for generative design and plasmid optimization, but does not contain functional outputs/outcomes (hard to get without going into the papers that are referenced in the deoposition), and is therefore of limited use in a machine-learning setting.
Many of the applications outlined in the readme above could be addressed more effectively with database searching techniques, machine learning is not desperately needed. It is also unclear whether the data are rich enough to contribute meaningfully to gene therapy research, protein production, or synthetic biology, since they lack useful endpoints/measurements.

Stems from consistent, original source:
True. Addgene is a well-known resource that is heavily leveraged by the molecular biology research community.

Dataset does not contain obvious errors/ambiguous data:
Unclear. Were the data deduplicated at sequence level, not just accession level? There may not be much diversity in the dataset, since many plasmids are built on the same component. This deposition would benefit from some minimal exploratory data analysis to assess how useful the dataset would really be to machine learning.

The nested nature of the dataset would make working with it tricky. It may be more useful if it was broken up into more columns (e.g. Backbone → Vector backbone, backbone_size, vector_type, selectable_markers). At minimum, I would expect to see separate columns for DNA sequence and copy number.

The full sequence is not parsable and would need to be cleaned up for almost any use -see the current state: attagggtg 8041 atggttcacg tagtgggcca tcgccctgat agacggtttt tcgccctttg acgttggagt 8101 ccacgttctt taatagtgga ctcttgttcc aaactggaac aacactcaac cctat.

Comments:
With considerable reformatting and simplifying of the dataset, one could envision these possible machine learning tasks:

[Supervised binary classification] Predict the copy number of the plasmid (high, low)
[Supervised multi-task classification ] Predict the vector type/use (mammalian, yeast, bacterial expression)
[Supervised multi-task classification] Predict selectable markers
[Generative] Annotate a plasmid’s functional regions (requires cleanup of GenBank Raw column)
[Generative] Create the smallest possible version of a plasmid with certain constraints.
However, other than the last item, all of these tasks are essentially 'solved' in the drug-discovery context and machine learning would provide minimal impact.

Given the above considerations, I do not recommend certification of this dataset in its current form.

2 replies

wconnell Nov 7, 2024
Author

Hi, thanks for the review.

Many of the applications outlined in the readme above could be addressed more effectively with database searching techniques, machine learning is not desperately needed. It is also unclear whether the data are rich enough to contribute meaningfully to gene therapy research, protein production, or synthetic biology, since they lack useful endpoints/measurements.

I’d like to clarify: Are you solely interested in certifying datasets that directly accommodate discriminative machine learning tasks?

cwognum Nov 7, 2024
Maintainer

Hi @wconnell,

First of all, thanks for the submission!

To answer your question: The goal of Polaris is to host ML-ready resources, but it doesn't necessarily have to be discriminative ML. Does that answer your question? Happy to elaborate further if not!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenPlasmid #23

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

OpenPlasmid #23

Uh oh!

Uh oh!

wconnell Oct 24, 2024

Polaris Link

README

Overview

Applications in Drug Discovery

Data Source and Generation

Data Curation and Quality Assurance

Consistency and Standardization

Error Checking

Dataset Schema

Core Fields

Nested Fields

Additional Fields

How to extract richly-annotated GenBank sequences

Dataset Source

Dataset Curation

Dataset Completeness

Anything else we should know?

Replies: 3 comments · 2 replies

Uh oh!

Uh oh!

wconnell Oct 24, 2024 Author

Uh oh!

zhu0619 Nov 4, 2024 Maintainer

Uh oh!

stwhitfield Nov 5, 2024

Uh oh!

wconnell Nov 7, 2024 Author

Uh oh!

Uh oh!

cwognum Nov 7, 2024 Maintainer

wconnell
Oct 24, 2024

Replies: 3 comments 2 replies

wconnell
Oct 24, 2024
Author

zhu0619
Nov 4, 2024
Maintainer

stwhitfield
Nov 5, 2024

wconnell Nov 7, 2024
Author

cwognum Nov 7, 2024
Maintainer