You can just clone the repository where you will run the analysis:
git clone https://github.com/UMMISCO/strainmake.gitEnsure to have at least Snakemake and Conda installed.
You can use the Docker image, and run everything via a Docker container (Snakemake is the entrypoint):
docker pull bapt931894/strainmake:latest
# for example the following command will output the help of Snakemake (Snakemake is the workflow management system)
docker run bapt931894/strainmake -h Note that you should mount volumes for keeping the generated data:
docker run bapt931894/strainmake \
-v /where/to/keep/results:/opt/strainmake/results \
-v /where/to/keep/logs:/opt/strainmake/logs \
-v /where/to/keep/benchmarks:/opt/strainmake/benchmarks \
...Configuration, paths, etc., should also be consistent with the container file context.
A step by step example of use is available on the wiki.
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| fastp | 2018 | Yes | https://github.com/OpenGene/fastp | Yes |
| fastQC | 2010 | Yes | http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ | Yes |
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| bowtie2 | 2012 | Yes | https://github.com/BenLangmead/bowtie2 | Yes |
Human assembly for mapping:
- GRCh38. Link
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| MEGAHIT | 2015 | Yes | https://github.com/voutcn/megahit | Yes |
| (Meta)SPAdes | 2017 | Yes | https://github.com/ablab/spades | Yes |
| (Meta)Flye | 2020 | Yes | https://github.com/mikolmogorov/Flye | Yes |
| HyLight | 2024 | Yes | https://github.com/LuoGroup2023/HyLight | Yes |
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| QUAST | 2013 | Yes | https://github.com/ablab/quast | Yes |
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| Prodigal | 2010 | Yes | https://github.com/hyattpd/Prodigal | Yes |
| CD-HIT | 2012 | Yes | https://github.com/weizhongli/cdhit | Yes |
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| MetaBAT 2 | 2019 | Yes | https://bitbucket.org/berkeleylab/metabat/src/master/ | Yes |
| SemiBin2 | 2023 | Yes | https://github.com/BigDataBiology/SemiBin | Yes |
| VAMB | 2019 | Yes | https://github.com/RasmussenLab/vamb | Yes |
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| CheckM2 | 2023 | Yes | https://github.com/chklovski/CheckM2 | Yes |
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| Binette | 2024 | Yes | https://github.com/genotoul-bioinfo/Binette | Yes |
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| GTDB-Tk | 2022 | Yes | https://github.com/Ecogenomics/GTDBTk | Yes |
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| dRep | 2017 | Yes | https://github.com/MrOlm/drep | Yes |
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| Prodigal | 2010 | Yes | https://github.com/hyattpd/Prodigal | Yes |
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| CheckM | 2015 | Yes | https://github.com/Ecogenomics/CheckM | Yes |
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| CarveMe | 2018 | Yes | https://github.com/cdanielmachado/carveme | Yes |
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| MetaPhlAn | 2023 | Yes | https://github.com/biobakery/MetaPhlAn | Yes |
| Meteor2 | 2024 | Yes | https://github.com/metagenopolis/meteor | Yes |
| StrainScan | 2023 | Yes | https://github.com/liaoherui/StrainScan | Yes |
| Tool | First release | Conda available? | Link | Implemented? |
|---|---|---|---|---|
| inStrain | 2021 | Yes | https://github.com/MrOlm/inStrain | Yes (for SR only) |
| Floria | 2024 | Yes | https://github.com/bluenote-1577/floria | Yes (for SR only) |
You can find in workflow/scripts/other_scripts scripts made for processing some results produced by this pipeline.
skani_analysis.py performs bins pairwise comparison using Skani (https://doi.org/10.1038/s41592-023-02018-3). It can also produce a Venn diagram for results derived from dereplicated bins.
usage: skani_analysis.py compare [-h] --bins {refined,dereplicated} --tmp TMP --output_file OUTPUT_FILE --tsv_output TSV_OUTPUT --ani_threshold ANI_THRESHOLD --json_output JSON_OUTPUT --venn_diagram
VENN_DIAGRAM --cpu CPU
options:
-h, --help show this help message and exit
--bins {refined,dereplicated}
Type of bins to analyze
--tmp TMP Temporary directory for intermediate files
--output_file OUTPUT_FILE
File to save the output results (Skani matrix)
--tsv_output TSV_OUTPUT
File to save the Skani matrix in TSV format
--ani_threshold ANI_THRESHOLD
Minimal ANI to consider two bins as the same
--json_output JSON_OUTPUT
File to save the bins similarity results according to assembly methods (JSON)
--venn_diagram VENN_DIAGRAM
Where to save the Venn diagram
--cpu CPU Number of CPU cores to use
We can then check bins found from one assembly method only.
usage: skani_analysis.py check [-h] --json_results JSON_RESULTS --tsv_output TSV_OUTPUT --assembly {unique,megahit,metaflye,metaspades,hybridspades}
options:
-h, --help show this help message and exit
--json_results JSON_RESULTS
Path to the JSON produced using "skani_analysis.py compare"
--tsv_output TSV_OUTPUT
File to save the results in TSV format
--assembly {unique,megahit,metaflye,metaspades,hybridspades}
Choose 'unique' to get a list of bins that were not found from at least a second assembly method, at the given ANI threshold you used with the 'compare' subcommand. Chose any other possible assembly method to
get a list of bins recovered from the given assembly (it won't return the redundant bins coming from other asssemblies)
calculate_binned_contigs.py allows to compute the binned rate of contigs, i.e. the percentage of contigs from an assembly that is found in at least one bin at the end.
The script can do it for each sample and its generated contigs for a given assembly method.
usage: calculate_binned_contigs.py [-h] --assembler {megahit,metaflye,hybridspades,metaspades} --results-dir RESULTS_DIR --type {binette,dereplicated_and_filtered} --tsv_output_binned_contigs TSV_OUTPUT_BINNED_CONTIGS
--tsv_output_binned_rate TSV_OUTPUT_BINNED_RATE
Count assembly contigs assigned to a bin.
options:
-h, --help show this help message and exit
--assembler {megahit,metaflye,hybridspades,metaspades}
The assembly we should use
--results-dir RESULTS_DIR
Folder storing the pipeline results. Typically named 'results': /path/to/pipeline/results
--type {binette,dereplicated_and_filtered}
Type of bins
--tsv_output_binned_contigs TSV_OUTPUT_BINNED_CONTIGS
File to save the list of contigs and their number of assignation in bins in TSV format
--tsv_output_binned_rate TSV_OUTPUT_BINNED_RATE
File to save the binning rate of contigs in TSV format
If your sequencing reads have already been preprocessed, you can use the already_preprocessed_seq.py script to set up the results directory so that the pipeline starts directly from the assembly step, using your preprocessed FASTQ files.
To do this, provide a TSV file formatted like config_data.tsv. The script will create symbolic links in the results folder that point to your preprocessed FASTQ files, saving storage space by avoiding unnecessary duplication.
This approach ensures that the pipeline can use your preprocessed data without needing to process the FASTQ files again.
An interactive CLI for editing YAML configuration and including relevant pipeline sections is available at config_generator.py.
Usage: config_generator.py [OPTIONS]
Generate a configuration YAML file for the pipeline.
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --samples PATH Path to sample metadata file needed by the pipeline (TSV) [default: None] [required] │
│ --lr-seq-format TEXT Format of long reads: 'fastq' or 'fasta' [default: fastq] │
│ --output PATH Path to write the final YAML [default: config.yaml] │
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the installation. │
│ --help Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Then, to generate the Snakefile with relevant data to be generated by the pipeline, based on the YAML, use snakefile_generator.py.
Usage: snakefile_generator.py [OPTIONS]
Generate a configuration YAML file for the pipeline.
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --config PATH Path to the generated YAML configuration file [default: None] [required] │
│ --output PATH Path to write the Snakefile [default: Snakefile] │
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the installation. │
│ --help Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Use the following CLI: generate_multiqc_report.py.
Usage: generate_multiqc_report.py [OPTIONS]
Generates MultiQC report(s) based on the pipeline results collected from the specified directories.
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --results-dir TEXT Directory containing the results of the pipeline. [default: None] [required] │
│ * --log-dir TEXT Directory containing the logs of the pipeline. [default: None] [required] │
│ * --output-dir TEXT Directory where the MultiQC report will be generated. [default: None] [required] │
│ --ani INTEGER ANI threshold used for dereplicating MAGs. [default: 95] │
│ --multiqc-config TEXT Path to the MultiQC configuration file. [default: multiqc_config.yaml] │
│ --dry-run -d If set, only prints the MultiQC command without executing it. │
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the installation. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
