Replies: 3 comments 2 replies
-
I opened a PR for the dataset build logic recipe: #24 |
Beta Was this translation helpful? Give feedback.
-
@wconnell Thank you for the submission. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your submission! I've been tagged as a domain expert to provide feedback. Applications in real-world drug discovery: Stems from consistent, original source: Dataset does not contain obvious errors/ambiguous data: The nested nature of the dataset would make working with it tricky. It may be more useful if it was broken up into more columns (e.g. Backbone → Vector backbone, backbone_size, vector_type, selectable_markers). At minimum, I would expect to see separate columns for DNA sequence and copy number. The full sequence is not parsable and would need to be cleaned up for almost any use -see the current state: attagggtg 8041 atggttcacg tagtgggcca tcgccctgat agacggtttt tcgccctttg acgttggagt 8101 ccacgttctt taatagtgga ctcttgttcc aaactggaac aacactcaac cctat. Comments:
Given the above considerations, I do not recommend certification of this dataset in its current form. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Polaris Link
https://polarishub.io/datasets/wconnell/openplasmid-v1
README
Overview
OpenPlasmid is a comprehensive dataset containing detailed information on approximately 150,000 plasmids originally deposited on Addgene, a nonprofit plasmid repository. This dataset includes textual descriptions, depositor and study information, annotated GenBank sequences, and more, facilitating research in molecular biology, genetics, and drug discovery. You can find rich educational resources about plasmids and their use in molecular biology and therapeutics on Addgene's blog webpage.
Applications in Drug Discovery
Data Source and Generation
Data Curation and Quality Assurance
Consistency and Standardization
Error Checking
Dataset Schema
Each plasmid sample is structured either as a string or a dictionary object with consistent nested fields for simple parsing. The metadata is gathered from webpage annotations, see for example Plasmid #48138. The unique field that can be loaded for rich sequence feature annotations is
GenBank Raw
.Core Fields
Name
(string
): Plasmid's name.ID
(string
): Unique Addgene identifier.Purpose
(string
): Function of the plasmid.Depositing Lab
(string
): Lab or researcher who deposited the plasmid.Flame
(string
): Popularity status. Options:"High"
,"Medium"
,"Low"
.GenBank File
(string
): URL to download GenBank file.Sequence Type
(string
): Sequence completeness. Options:"full"
,"partial"
.GenBank Raw
(string
): Raw GenBank sequence as text.Nested Fields
Backbone
(object
): Information about the plasmid backbone.Vector Backbone
(string
): Name of the vector backbone.Backbone Size w/o Insert (bp)
(string
): Size of the backbone without inserts.Vector Type
(string
): Type of vector.Selectable Markers
(string
): Resistance or selection markers.Gene/Insert X
(object
): Information about each gene or insert. Entries 1-3 can exist. Few plasmids have more than a 1-3 gene inserts; for those that do this information is not collected. However, full gene/insert information can be accessed by directly parsing theGenBank Raw
column.Gene/Insert Name
(string
): Name of the gene/insert.Species
(string
): Origin species.Insert Size (bp)
(string
): Size of the insert.Mutation
(string
): Mutation details.Entrez Gene
(string
): Link or ID to Entrez Gene.Promoter
(string
): Promoter sequence.Tag / Fusion Protein
(string
): Tags or fusion proteins.Cloning Information for Gene/Insert X
(object
): Cloning details for each insert.Cloning Method
(string
): Method used for cloning.5′ Cloning Site
(string
): 5' restriction site.3′ Cloning Site
(string
): 3' restriction site.5′ Sequencing Primer
(string
): 5' sequencing primer.3′ Sequencing Primer
(string
): 3' sequencing primer.Growth in Bacteria
(object
): Information about bacterial growth.Bacterial Resistance(s)
(string
): Resistance markers.Growth Temperature
(string
): Optimal temperature for growth.Growth Strain(s)
(string
): Recommended strains.Copy Number
(string
): Plasmid copy number.Terms and Licenses
(object
): Licensing information. Only sequences available for academic and nonprofit use were scraped. No sequences were obtained for plasmid's that require acceptance of Addgene's Affinity Reagent Sequence Policy. Please see Addgene's detailed Terms of Use.References
(object
): Bibliographic information.Title
(string
): Title of the publication.Authors
(string
): List of authors.Publication
(string
): Journal or source.DOI
(string
): Digital Object Identifier.PubMed Link
(string
): URL to PubMed entry.PubMed ID
(string
): Unique PubMed identifier.Additional Fields
Depositor Comments
(string
): Notes or comments from the depositor.Notes
(string
): Any other relevant information.How to extract richly-annotated GenBank sequences
Dataset Source
Addgene
Dataset Curation
https://github.com/polaris-hub/polaris-recipes/org-OpenPlasmid/
Dataset Completeness
readme
,source
andcuration_reference
fields for my Polaris dataset.Anything else we should know?
No response
Beta Was this translation helpful? Give feedback.
All reactions