StrainMake

Set up

Local version

You can just clone the repository where you will run the analysis:

git clone https://github.com/UMMISCO/strainmake.git

Ensure to have at least Snakemake and Conda installed.

Using Docker

You can use the Docker image, and run everything via a Docker container (Snakemake is the entrypoint):

docker pull bapt931894/strainmake:latest
# for example the following command will output the help of Snakemake (Snakemake is the workflow management system)
docker run bapt931894/strainmake -h

Note that you should mount volumes for keeping the generated data:

docker run bapt931894/strainmake \
    -v /where/to/keep/results:/opt/strainmake/results \
    -v /where/to/keep/logs:/opt/strainmake/logs \
    -v /where/to/keep/benchmarks:/opt/strainmake/benchmarks \
    ...

Configuration, paths, etc., should also be consistent with the container file context.

How to run

A step by step example of use is available on the wiki.

Overview of integrated tools

Quality control, preprocessing

Tool	First release	Conda available?	Link	Implemented?
fastp	2018	Yes	https://github.com/OpenGene/fastp	Yes
fastQC	2010	Yes	http://www.bioinformatics.babraham.ac.uk/projects/fastqc/	Yes

Human decontamination

Tool	First release	Conda available?	Link	Implemented?
bowtie2	2012	Yes	https://github.com/BenLangmead/bowtie2	Yes

Human assembly for mapping:

GRCh38. Link

Asssembly

Tool	First release	Conda available?	Link	Implemented?
MEGAHIT	2015	Yes	https://github.com/voutcn/megahit	Yes
(Meta)SPAdes	2017	Yes	https://github.com/ablab/spades	Yes
(Meta)Flye	2020	Yes	https://github.com/mikolmogorov/Flye	Yes
HyLight	2024	Yes	https://github.com/LuoGroup2023/HyLight	Yes

Assembly quality assessment

Tool	First release	Conda available?	Link	Implemented?
QUAST	2013	Yes	https://github.com/ablab/quast	Yes

Non-redundant gene catalog

Tool	First release	Conda available?	Link	Implemented?
Prodigal	2010	Yes	https://github.com/hyattpd/Prodigal	Yes
CD-HIT	2012	Yes	https://github.com/weizhongli/cdhit	Yes

Binning

Tool	First release	Conda available?	Link	Implemented?
MetaBAT 2	2019	Yes	https://bitbucket.org/berkeleylab/metabat/src/master/	Yes
SemiBin2	2023	Yes	https://github.com/BigDataBiology/SemiBin	Yes
VAMB	2019	Yes	https://github.com/RasmussenLab/vamb	Yes

Bins quality assessment

Tool	First release	Conda available?	Link	Implemented?
CheckM2	2023	Yes	https://github.com/chklovski/CheckM2	Yes

Bins refinement

Tool	First release	Conda available?	Link	Implemented?
Binette	2024	Yes	https://github.com/genotoul-bioinfo/Binette	Yes

Bins post-processing(gene prediction, functional annotation and taxonomy classification)

Taxonomic annotation

Tool	First release	Conda available?	Link	Implemented?
GTDB-Tk	2022	Yes	https://github.com/Ecogenomics/GTDBTk	Yes

Dereplication

Tool	First release	Conda available?	Link	Implemented?
dRep	2017	Yes	https://github.com/MrOlm/drep	Yes

Genes prediction

Tool	First release	Conda available?	Link	Implemented?
Prodigal	2010	Yes	https://github.com/hyattpd/Prodigal	Yes

Coverage estimation

Tool	First release	Conda available?	Link	Implemented?
CheckM	2015	Yes	https://github.com/Ecogenomics/CheckM	Yes

Metabolic modeling

Tool	First release	Conda available?	Link	Implemented?
CarveMe	2018	Yes	https://github.com/cdanielmachado/carveme	Yes

Taxonomic profiling

Tool	First release	Conda available?	Link	Implemented?
MetaPhlAn	2023	Yes	https://github.com/biobakery/MetaPhlAn	Yes
Meteor2	2024	Yes	https://github.com/metagenopolis/meteor	Yes
StrainScan	2023	Yes	https://github.com/liaoherui/StrainScan	Yes

Strains profiling

Tool	First release	Conda available?	Link	Implemented?
inStrain	2021	Yes	https://github.com/MrOlm/inStrain	Yes (for SR only)
Floria	2024	Yes	https://github.com/bluenote-1577/floria	Yes (for SR only)

More scripts

You can find in workflow/scripts/other_scripts scripts made for processing some results produced by this pipeline.

skani_analysis.py performs bins pairwise comparison using Skani (https://doi.org/10.1038/s41592-023-02018-3). It can also produce a Venn diagram for results derived from dereplicated bins.

usage: skani_analysis.py compare [-h] --bins {refined,dereplicated} --tmp TMP --output_file OUTPUT_FILE --tsv_output TSV_OUTPUT --ani_threshold ANI_THRESHOLD --json_output JSON_OUTPUT --venn_diagram
                                 VENN_DIAGRAM --cpu CPU

options:
  -h, --help            show this help message and exit
  --bins {refined,dereplicated}
                        Type of bins to analyze
  --tmp TMP             Temporary directory for intermediate files
  --output_file OUTPUT_FILE
                        File to save the output results (Skani matrix)
  --tsv_output TSV_OUTPUT
                        File to save the Skani matrix in TSV format
  --ani_threshold ANI_THRESHOLD
                        Minimal ANI to consider two bins as the same
  --json_output JSON_OUTPUT
                        File to save the bins similarity results according to assembly methods (JSON)
  --venn_diagram VENN_DIAGRAM
                        Where to save the Venn diagram
  --cpu CPU             Number of CPU cores to use

We can then check bins found from one assembly method only.

usage: skani_analysis.py check [-h] --json_results JSON_RESULTS --tsv_output TSV_OUTPUT --assembly {unique,megahit,metaflye,metaspades,hybridspades}

options:
  -h, --help            show this help message and exit
  --json_results JSON_RESULTS
                        Path to the JSON produced using "skani_analysis.py compare"
  --tsv_output TSV_OUTPUT
                        File to save the results in TSV format
  --assembly {unique,megahit,metaflye,metaspades,hybridspades}
                        Choose 'unique' to get a list of bins that were not found from at least a second assembly method, at the given ANI threshold you used with the 'compare' subcommand. Chose any other possible assembly method to
                        get a list of bins recovered from the given assembly (it won't return the redundant bins coming from other asssemblies)

calculate_binned_contigs.py allows to compute the binned rate of contigs, i.e. the percentage of contigs from an assembly that is found in at least one bin at the end. The script can do it for each sample and its generated contigs for a given assembly method.

usage: calculate_binned_contigs.py [-h] --assembler {megahit,metaflye,hybridspades,metaspades} --results-dir RESULTS_DIR --type {binette,dereplicated_and_filtered} --tsv_output_binned_contigs TSV_OUTPUT_BINNED_CONTIGS
                                   --tsv_output_binned_rate TSV_OUTPUT_BINNED_RATE

Count assembly contigs assigned to a bin.

options:
  -h, --help            show this help message and exit
  --assembler {megahit,metaflye,hybridspades,metaspades}
                        The assembly we should use
  --results-dir RESULTS_DIR
                        Folder storing the pipeline results. Typically named 'results': /path/to/pipeline/results
  --type {binette,dereplicated_and_filtered}
                        Type of bins
  --tsv_output_binned_contigs TSV_OUTPUT_BINNED_CONTIGS
                        File to save the list of contigs and their number of assignation in bins in TSV format
  --tsv_output_binned_rate TSV_OUTPUT_BINNED_RATE
                        File to save the binning rate of contigs in TSV format

Using preprocessed reads

If your sequencing reads have already been preprocessed, you can use the already_preprocessed_seq.py script to set up the results directory so that the pipeline starts directly from the assembly step, using your preprocessed FASTQ files.

To do this, provide a TSV file formatted like config_data.tsv. The script will create symbolic links in the results folder that point to your preprocessed FASTQ files, saving storage space by avoiding unnecessary duplication.

This approach ensures that the pipeline can use your preprocessed data without needing to process the FASTQ files again.

Help with configuration file

An interactive CLI for editing YAML configuration and including relevant pipeline sections is available at config_generator.py.

 Usage: config_generator.py [OPTIONS]                                                                                                                                                        
                                                                                                                                                                                             
 Generate a configuration YAML file for the pipeline.                                                                                                                                        
                                                                                                                                                                                             
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --samples                   PATH  Path to sample metadata file needed by the pipeline (TSV) [default: None] [required]                                                                 │
│    --lr-seq-format             TEXT  Format of long reads: 'fastq' or 'fasta' [default: fastq]                                                                                            │
│    --output                    PATH  Path to write the final YAML [default: config.yaml]                                                                                                  │
│    --install-completion              Install completion for the current shell.                                                                                                            │
│    --show-completion                 Show completion for the current shell, to copy it or customize the installation.                                                                     │
│    --help                            Show this message and exit.                                                                                                                          │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Then, to generate the Snakefile with relevant data to be generated by the pipeline, based on the YAML, use snakefile_generator.py.

 Usage: snakefile_generator.py [OPTIONS]                                                                                                                                                     
                                                                                                                                                                                             
 Generate a configuration YAML file for the pipeline.                                                                                                                                        
                                                                                                                                                                                             
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --config                    PATH  Path to the generated YAML configuration file [default: None] [required]                                                                             │
│    --output                    PATH  Path to write the Snakefile [default: Snakefile]                                                                                                     │
│    --install-completion              Install completion for the current shell.                                                                                                            │
│    --show-completion                 Show completion for the current shell, to copy it or customize the installation.                                                                     │
│    --help                            Show this message and exit.                                                                                                                          │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

MultiQC report

Use the following CLI: generate_multiqc_report.py.

 Usage: generate_multiqc_report.py [OPTIONS]                                                                                                                                                    
                                                                                                                                                                                                
 Generates MultiQC report(s) based on the pipeline results collected from the specified directories.                                                                                            
                                                                                                                                                                                                
                                                                                                                                                                                                
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --results-dir                 TEXT     Directory containing the results of the pipeline. [default: None] [required]                                                                       │
│ *  --log-dir                     TEXT     Directory containing the logs of the pipeline. [default: None] [required]                                                                          │
│ *  --output-dir                  TEXT     Directory where the MultiQC report will be generated. [default: None] [required]                                                                   │
│    --ani                         INTEGER  ANI threshold used for dereplicating MAGs. [default: 95]                                                                                           │
│    --multiqc-config              TEXT     Path to the MultiQC configuration file. [default: multiqc_config.yaml]                                                                             │
│    --dry-run             -d               If set, only prints the MultiQC command without executing it.                                                                                      │
│    --install-completion                   Install completion for the current shell.                                                                                                          │
│    --show-completion                      Show completion for the current shell, to copy it or customize the installation.                                                                   │
│    --help                                 Show this message and exit.                                                                                                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
.docker		.docker
.test		.test
config		config
data		data
dev		dev
workflow		workflow
.gitignore		.gitignore
README.md		README.md
v0.1.0-pre.png		v0.1.0-pre.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StrainMake

Set up

Local version

Using Docker

How to run

Overview of integrated tools

Quality control, preprocessing

Human decontamination

Asssembly

Assembly quality assessment

Non-redundant gene catalog

Binning

Bins quality assessment

Bins refinement

Bins post-processing(gene prediction, functional annotation and taxonomy classification)

Taxonomic annotation

Dereplication

Genes prediction

Coverage estimation

Metabolic modeling

Taxonomic profiling

Strains profiling

More scripts

Using preprocessed reads

Help with configuration file

MultiQC report

About

Uh oh!

Releases

Languages

UMMISCO/strainmake

Folders and files

Latest commit

History

Repository files navigation

StrainMake

Set up

Local version

Using Docker

How to run

Overview of integrated tools

Quality control, preprocessing

Human decontamination

Asssembly

Assembly quality assessment

Non-redundant gene catalog

Binning

Bins quality assessment

Bins refinement

Bins post-processing(gene prediction, functional annotation and taxonomy classification)

Taxonomic annotation

Dereplication

Genes prediction

Coverage estimation

Metabolic modeling

Taxonomic profiling

Strains profiling

More scripts

Using preprocessed reads

Help with configuration file

MultiQC report

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages