WEPP (Wastewater-Based Epidemiology using Phylogenetic Placements) is a pathogen-agnostic pipeline that enhances wastewater surveillance by leveraging the pathogen’s full phylogeny. It reports haplotype and lineage abundances, maps reads parsimoniously to selected haplotypes, and flags Unaccounted Alleles — those observed in the sample but unexplained by selected haplotypes, potentially indicating novel variants (Figure 1A). An interactive dashboard enables visualization of haplotypes in the global phylogenetic tree (Figure 1B(i)) and read-level analysis by selecting a haplotype (Figure 1B(ii)). Additional information about individual reads or haplotypes can be accessed by clicking on their corresponding objects (Figures 1B(iii-iv)), respectively.
WEPP performs parsimonious read placement on the mutation-annotated tree (MAT) to select a subset of haplotypes and adds their neighbors to form an initial candidate pool, which is passed to a deconvolution algorithm to estimate their relative abundances. WEPP retains haplotypes above an abundance threshold, and iteratively adds their neighbors and recomputes abundances until convergence or a maximum iteration count. An outlier detection algorithm flags Unaccounted Alleles from the deconvolution residue (Figure 1C).
WEPP offers multiple installation methods. Using a Docker is recommended to prevent any conflict with existing packages.
- Docker image from DockerHub
- Dockerfile
- Shell Commands
linux/amd64
platform. While it can run on arm64
systems (e.g., Apple Silicon or Linux aarch64) via emulation, this may lead to reduced performance.
The Docker image includes all dependencies required to run WEPP.
Step 1: Get the image from DockerHub
docker pull pranavgangwar/wepp:latest
Step 2: Start and run Docker container. The command below will take you inside the docker container with WEPP already installed.
# -p <host_port>:<container_port> → Maps container port to a port on your host (Accessing Dashboard, NOT needed otherwise)
# Replace <host_port> with your desired local port (e.g., 80 or 8080)
# Use this command if your datasets can be downloaded from the Web
docker run -it -p 80:80 pranavgangwar/wepp:latest
# Use this command if your datasets are present in your current directory
docker run -it -p 80:80 -v "$PWD":/WEPP -w /WEPP pranavgangwar/wepp:latest
Step 3: Confirm proper working by running the following command. This should print WEPP's help menu.
snakemake test --cores 1 --use-conda
All set to try the examples.
The Dockerfile contains all dependencies required to run WEPP.
Step 1: Clone the repository
git clone --recurse-submodules https://github.com/TurakhiaLab/WEPP.git
cd WEPP
Step 2: Build a Docker Image
cd docker
docker build -t wepp .
cd ..
Step 3: Start and run Docker container. The command below will take you inside the docker container with the view of the current directory.
# -p <host_port>:<container_port> → Maps container port to a port on your host (Accessing Dashboard, NOT needed otherwise)
# Replace <host_port> with your desired local port (e.g., 80 or 8080)
docker run -it -p 80:80 -v "$PWD":/workspace -w /workspace wepp
All set to try the examples.
Users without sudo access are advised to install WEPP via Docker Image.
Step 1: Clone the repository
git clone --recurse-submodules https://github.com/TurakhiaLab/WEPP.git
cd WEPP
Step 2: Install dependencies (might require sudo access) WEPP depends on the following common system libraries, which are typically pre-installed on most development environments:
- wget
- curl
- pip
- build-essential
- python3-pandas
- pkg-config
- zip
- cmake
- libtbb-dev
- libprotobuf-dev
- protobuf-compiler
- snakemake
- conda
- nodejs(v18+)
- nginx
For Ubuntu users with sudo access, if any of the required libraries are missing, you can install them with:
sudo apt-get update
sudo apt-get install -y wget pip curl python3-pip build-essential python3-pandas pkg-config zip cmake libtbb-dev libprotobuf-dev protobuf-compiler snakemake nginx
Note: WEPP expects the python
command to be available. If your system only provides python3, you can optionally set up a symlink:
update-alternatives --install /usr/bin/python python /usr/bin/python3 1
If you do not have Node.js v18 or higher installed, follow these steps to install Node.js v22:
# Update and install prerequisites
apt-get install -y curl gnupg ca-certificates
# Add NodeSource Node.js 22 repo
curl -fsSL https://deb.nodesource.com/setup_22.x | bash -
# Install Node.js 22
apt-get install -y nodejs
# Install Yarn package manager globally
npm install -g yarn
# Install TaxoniumTools Python package
pip install taxoniumtools
If your system doesn't have Conda, you can install it with:
wget -O Miniforge3.sh "https://github.com/conda-forge/miniforge/releases/download/24.11.3-2/Miniforge3-24.11.3-2-Linux-x86_64.sh"
bash Miniforge3.sh -b -p "${HOME}/conda"
source "${HOME}/conda/etc/profile.d/conda.sh"
source "${HOME}/conda/etc/profile.d/mamba.sh"
All set to try the examples.
The following steps will download real wastewater datasets and analyze them using WEPP.
Step 1: Download the RSV-A test dataset
mkdir -p data/RSVA_real
cd data/RSVA_real
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR147/011/ERR14763711/ERR14763711_*.fastq.gz https://hgdownload.gi.ucsc.edu/hubs/GCF/002/815/475/GCF_002815475.1/UShER_RSV-A/2025/04/25/rsvA.2025-04-25.pb.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/815/475/GCF_002815475.1_ASM281547v1/GCF_002815475.1_ASM281547v1_genomic.fna.gz
gunzip GCF_002815475.1_ASM281547v1_genomic.fna.gz
mv ERR14763711_1.fastq.gz ERR14763711_R1.fastq.gz
mv ERR14763711_2.fastq.gz ERR14763711_R2.fastq.gz
cd ../../
This will save the datasets on a separate data/RSVA_real folder within the repository.
Step 2: Run the pipeline
snakemake --config DIR=RSVA_real FILE_PREFIX=test_run PRIMER_BED=RSVA_all_primers_best_hits.bed TREE=rsvA.2025-04-25.pb.gz REF=GCF_002815475.1_ASM281547v1_genomic.fna CLADE_LIST=annotation_1 CLADE_IDX=0 DASHBOARD_ENABLED=True --cores 32 --use-conda
Step 3: Analyze Results
All results generated by WEPP are available in the results/RSVA_real
directory. These include haplotype and lineage abundances, associated uncertain haplotypes, and the potential haplotypes corresponding to each detected unaccounted allele.
Step 1: Download the SARS-CoV-2 test dataset
mkdir -p data/SARS_COV_2_real
cd data/SARS_COV_2_real
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR185/041/SRR18541041/SRR18541041_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR185/041/SRR18541041/SRR18541041_2.fastq.gz https://hgdownload.gi.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/2021/12/05/public-2021-12-05.all.masked.pb.gz
mv SRR18541041_1.fastq.gz SRR18541041_R1.fastq.gz
mv SRR18541041_2.fastq.gz SRR18541041_R2.fastq.gz
cp ../../NC_045512v2.fa .
cd ../../
This will save the datasets on a separate data/SARS_COV_2_real folder within the repository.
Step 2: Run the pipeline
snakemake --config DIR=SARS_COV_2_real FILE_PREFIX=test_run PRIMER_BED=snap_primers.bed TREE=public-2021-12-05.all.masked.pb.gz REF=NC_045512v2.fa DASHBOARD_ENABLED=True --cores 32 --use-conda
Step 3: Analyze Results
All results generated by WEPP are available in the results/SARS_COV_2_real
directory. These include haplotype and lineage abundances, associated uncertain haplotypes, and the potential haplotypes corresponding to each detected unaccounted allele.
We assume that all wastewater samples are organized in the data
directory, each within its own subdirectory given by DIR
argument (see Run Command). For each sample, WEPP generates intermediate and output files in corresponding subdirectories under intermediate
and result
, respectively.
Each created DIR
inside data
is expected to contain the following files:
- Sequencing Reads: Ending with
*_R{1/2}.fastq.gz
for paired-ended reads and*.fastq.gz
for single-ended. - Reference Genome in FASTA format
- Mutation-Annotated Tree (MAT)
- [OPTIONAL] Genome Masking File:
mask.bed
, whose third column specifies sites to be excluded from analysis. - [OPTIONAL] Taxonium
.jsonl
file to be used for visualizing results in the WEPP dashboard.
Visualization of WEPP's workflow directories
📁 WEPP
└───📁data # [User Created] Contains data to analyze
├───📁SARS_COV_2_real # SARS-CoV-2 run wastewater samples - 1
├───sars_cov_2_reads_R1.fastq.gz # Paired-ended reads
├───sars_cov_2_reads_R2.fastq.gz
├───sars_cov_2_reference.fa
├───mask.bed # OPTIONAL
├───sars_cov_2_taxonium.jsonl.gz # OPTIONAL
└───sars_cov_2_mat.pb.gz
└───📁intermediate # [WEPP Generated] Contains intermediate stage files
├───📁SARS_COV_2_real
├───file_1
└───file_2
└───📁results # [WEPP Generated] Contains final WEPP results
├───📁SARS_COV_2_real
├───file_1
└───file_2
The WEPP Snakemake pipeline requires the following arguments, which can be provided either via the configuration file (config/config.yaml
) or passed directly on the command line using the --config
argument. The command line arguments take precedence over the config file.
DIR
- Folder name containing the wastewater readsFILE_PREFIX
- File Prefix for all intermediate filesREF
- Reference Genome in fastaTREE
- Mutation-Annotated TreeSEQUENCING_TYPE
- Sequencing read type (s:Illumina single-ended, d:Illumina double-ended, or n:ONT long reads)PRIMER_BED
- BED file for primers from theprimers
folderMIN_AF
- Alleles with an allele frequency below this threshold in the reads will be masked.MIN_Q
- Alleles with a Phred score below this threshold in the reads will be masked.MAX_READS
- Maximum number of reads considered by WEPP from the sample. Helpful for reducing runtimeCLADE_LIST
- List the clade annotation schemes used in the MAT. SARS-CoV-2 MAT uses both nextstrain and pango lineage naming systems, so use "nextstrain,pango" for it.CLADE_IDX
- Index used for assigning clades to selected haplotypes from MAT. Use '1' for Pango naming and '0' for Nextstrain naming for SARS-CoV-2. Other pathogens usually follow a single lineage annotation system, so work with '0'. In case of NO lineage annotations, use '-1'. Lineage Annotations could be checked by running: "matUtils summary -i {TREE} -C {FILENAME}" -> Use '0' for annotation_1 and '1' for annotation_2.DASHBOARD_ENABLED
- Set toTrue
to enable the interactive dashboard for viewing WEPP results, orFalse
to disable it.TAXONIUM_FILE
[Optional] - Name of the user-provided Taxonium.jsonl
file for visualization. If specified, this file will be used instead of generating a new one from the given MAT. Ensure that the provided Taxonium file corresponds to the same MAT used for WEPP.
WEPP's snakemake workflow requires DIR
and FILE_PREFIX
as config arguments through the command line, while the remaining ones can be taken from the config file. It also requires --cores
from the command line, which specifies the number of threads used by the workflow.
Examples:
- Using all the parameters from the config file.
snakemake --config DIR=SARS_COV_2_real FILE_PREFIX=test_run TREE=sars_cov_2_mat.pb.gz REF=sars_cov_2_reference.fa --cores 32 --use-conda
- Overriding MIN_Q and CLADE_IDX through command line.
snakemake --config DIR=SARS_COV_2_real FILE_PREFIX=test_run TREE=sars_cov_2_mat.pb.gz REF=sars_cov_2_reference.fa MIN_Q=25 CLADE_IDX=1 --cores 32 --use-conda
- To visualize results from a previous WEPP analysis that was run without the dashboard, set
DASHBOARD_ENABLED
toTrue
and re-run only the dashboard components, without reanalyzing the dataset.
snakemake --config DIR=SARS_COV_2_real FILE_PREFIX=test_run TREE=sars_cov_2_mat.pb.gz REF=sars_cov_2_reference.fa MIN_Q=25 CLADE_IDX=1 DASHBOARD_ENABLED=True --cores 32 --use-conda --forcerun dashboard_serve
Mutation-annotated trees (MAT) for different pathogens are maintained by the UShER team, which can be found here. You can also create your own MAT for any pathogen from the consensus genome assemblies using viral_usher.
We welcome contributions from the community to enhance the capabilities of WEPP. If you encounter any issues or have suggestions for improvement, please open an issue on WEPP GitHub page. For general inquiries and support, reach out to our team.
If you use WEPP in your research or publications, please cite the following paper:
- Pranav Gangwar, Pratik Katte, Manu Bhatt, Yatish Turakhia, "WEPP: Phylogenetic Placement Achieves Near-Haplotype Resolution in Wastewater-Based Epidemiology", medRxiv 2025.06.09.25329287; doi: 10.1101/2025.06.09.25329287