A Snakemake workflow for calling and annotation of short variants.
Workflow takes paired-end Illumina short read data (fastq files) as input and outputs annotated variant calls in a vcf file as the final result.
Input directory contains PE Illumina reads from a publicly available SARS-CoV-2 dataset SRA accession SRR15660643 downsampled to 16000 paired reads (sample.R1.paired.fq.gz and sample.R2.paired.fq.gz).
A fasta file with the Wuhan-Hu-1 reference genome Genbank accession MN908947.3 is included in the
reference directory (MN908947.3.fasta), along with the VEP cache for successful annotation of genomic features.
git clone https://github.com/LorenaDerezanin/pipeline_test
Step 1: Install Miniconda
Minimal conda installer for running pipeline in an isolated conda environment to avoid dependency hell and ensure reproducibility.
conda install mamba -n base -c conda-forge
Recommended installation to speed up env setup. Mamba is a more robust and faster package manager (parallel download of data), and handles releases and dependencies better than conda. If continuing with conda, mamba command should be replaced with conda in Step 3.
cd pipeline_test/
mamba env create -n snek -f envs/snek.yml
conda activate snek
snakemake --use-conda --cores 4 --verbose
Number of suggested --cores when running pipeline locally, should be increased if running on a cluster.
If conda fails to install snakemake v.6.15, install snakemake with mamba: mamba install snakemake.
Bioinformatics tools used in the Snakemake workflow, in the form of snakemake wrappers obtained from The Snakemake Wrappers Repository:
- fastQC
- multiQC
- trim_galore
- bwa
- samtools
- picard
- freebayes
- bcftools
- vep
- to do:
- Docker container + conda/mamba
- AWS/Google cloud deployment
- unit tests