Genestrip-DB - a selection of databases for Genestrip
This project contains some configuration files and two scripts in order to generate databases and indexes for metagenomic analysis via Genestrip.
Genestrip-DB is free for any kind of use. However, the associated software, Genestrip, has a more restrictive License.
Genestrip-DB requires Maven 2 or 3 and the JRE 11 or higher.
To build the databases and indexes, cd
to the installation directory genestrip-db
. Given a matching Maven and JDK installation, sh bin/makedbs.sh
will generate 9 databases (and indexes) of different sizes. The generation process is resource intensive and may take several days for all databases.
Generating the bacterial databases is particularly time consuming.
Your machine should have:
- 1.5 TB of free disk space - mainly for downloading genomes from NCBI,
- at least 8 cores - the more the better (some phases of the database generation keep 32 cores 100% busy),
- 48 GB of main memory,
- a high bandwidth Internet connection.
The databases are based on and compatible with Genestrip v2.2.
All databases are genomic or based on total RNA.
Name | Category | Description | Database disk size | Sources and references |
---|---|---|---|---|
babesia |
protozoa |
Babesia species from the RefSeq and Genbank which are potentially pathogenic for humans | N/A | General knowledge |
borrelia |
bacteria |
Borrelia species from the RefSeq and Genbank | N/A | General knowledge |
borrelia_plasmid |
plasmid |
Borrelia species from the RefSeq and Genbank | N/A | General knowledge |
tick-borne |
bacteria |
Tick-borne infections from the RefSeq which are potentially pathogenic for humans | N/A | General knowledge, partially collected from Armin Labs |
tick-borne_rna |
bacteria |
Same as tick-borne but based on total RNA. |
N/A | |
human_virus2 |
viral |
Viruses from the RefSeq and Genbank which are potentially pathogenic for humans | N/A | Extracted from the Viral Zone |
parasites |
invertebrate |
Parasitic invertebrate animals from the RefSeq which are potentially pathogenic for humans | N/A | Collected from the book "Die Parasiten des Menschen" by Heinz Mehlhorn |
protozoa |
protozoa |
Protozoan parasites from the RefSeq which are potentially pathogenic for humans | N/A | Collected from the German book "Die Parasiten des Menschen" by Heinz Mehlhorn |
protozoa_rna |
protozoa |
Same as protozoa but based on total RNA |
N/A | |
vineyard |
fungi |
Fungal infections of grapevine taken from the RefSeq and Genbank | N/A | Collected from the German book "Rebschutz" by Walter Hildebrand, Dieter Lorenz and Friedrich Louis |
plasmopara |
plant |
Peronosporales as infections of grapevine taken from the RefSeq and Genbank | N/A | Collected from the German book "Rebschutz" by Walter Hildebrand, Dieter Lorenz and Friedrich Louis |
fungal_infect |
fungi |
Fungal infections of humans from the RefSeq and Genbank | N/A | Collected from the German book "Rebschutz" by Walter Hildebrand, Dieter Lorenz and Friedrich Louis |
Note that Genestrip's updateddb
-phase accounts for unspecific k-mers and largely avoids false positive counts during match
es.
To further reduce false positives, all databases except for vineyard
, chronicb-rna
and protozoa-rna
are built such that k-mers also occurring in the human genome
are pushed to the least common ancestor.
The script bin/matchticks.sh
runs the Genestrip goal match
for 8 fastq files taken from this publication.
To do so, the fastq files will be streamed from the corresponding NCBI server.
As expected, Genestrip finds DNA from borrelia and other tick-borne infections accordingly.
If you don't want to generate them yourself, the databases and indexes can also be downloaded from genestrip.it.hs-heilbronn.de.
The projects
folder corresponds
to the projects
folder's state of this project, after the scripts bin/makedbs.sh
and bin/matchticks.sh
have run successfully on the RefSeq Release 230.