A bash-based bioinformatics pipeline for analyzing pooled sequencing data with maximum flexibility for customisation and integration across computing environments.
The pipeline was used to produce results in the following papers
- Genetic mapping of resistance: A QTL and associated polymorphism conferring resistance to alpha-cypermethrin in Anopheles funestus T AL-Yazeedi, G Djuifo, L Mugenzi, A Muhammad, J Hearn, CS Wondji bioRxiv, 2025.04. 16.649184 link
- Overview
- Quick Start
- Installation
- Data Preparation
- Configuration
- Running the Pipeline
- Additional Analyses
- Output Files
- Troubleshooting
- Citation
This pipeline processes paired-end FASTQ files from pooled DNA samples through complete variant calling and population analysis:
Pipeline Steps:
- Quality Control - FastQC
- Alignment - BWA-MEM2
- BAM Processing - GATK/Picard
- Variant Calling - FreeBayes
- Variant Annotation - SnpEff
- Population Genetics - grenedalf
- BSA QTLsqr - qtlseqr
Additional Analyses:
- Bulk Segregant Analysis (BSA) for QTL mapping
- Cochran-Mantel-Haenszel (CMH) test for population differentiation popoolation2 (https://sourceforge.net/p/popoolation2/wiki/Tutorial/)
# Clone and setup
git clone https://github.com/your-username/poolseq-pipeline.git
cd poolseq-pipeline
mkdir 1_data genome annotation input
# Add data and run
# (Add FASTQ files to 1_data/, reference to genome/, annotation to annotation/, pools.txt to input/)
cd pipeline && bash run_all.bash- Linux or macOS
- Conda/Mamba package manager
# Create new environment
conda create -n poolseq python=3.12
# Activate environment
conda activate poolseq
# Install core bioinformatics tools
mamba install -c bioconda bwa-mem2 samtools bcftools freebayes gatk4 snpeff snpsift grenedalf fastqc multiqc qualimap parallel
# Install R packages for BSA analysis (optional)
mamba install -c conda-forge r-base r-essentials r-tidyverse r-ggplot2 r-devtools r-docopt
# Install population genetics tools (optional)
mamba install -c bioconda popoolation2# Test key tools
bwa-mem2 version
samtools --version
freebayes --version
gatk --version
grenedalf --version# Clone or download pipeline scripts to a directory called 'pipeline'
cd /path/to/your/project
mkdir pipeline
# Copy all .sh scripts to pipeline/ directory# Navigate to the cloned repository
cd poolseq-pipeline
# Create data and reference directories (pipeline/ already exists)
mkdir 1_data genome annotation input
# Final directory structure:
# poolseq-pipeline/
# ├── 1_data/ # FASTQ files
# ├── genome/ # Reference genome
# ├── annotation/ # GFF annotation files
# ├── input/ # Pool configuration
# ├── pipeline/ # Pipeline scripts (from GitHub)
# └── README.md # Documentation💡 Tip: File naming is configurable via
config.shparameters
cd 1_data
# Place your paired-end FASTQ files here
# Naming convention is defined in config.sh:
# - fa_ex_1='_1.fq.gz' (forward reads extension)
# - fa_ex_2='_2.fq.gz' (reverse reads extension)
# Files should be named: SAMPLE_NAME + extension
# Examples:
# Pool_Alive_test_1.fq.gz # Forward reads for Pool_Alive_test sample
# Pool_Alive_test_2.fq.gz # Reverse reads for Pool_Alive_test sample
# Pool_dead_test_1.fq.gz # Forward reads for Pool_dead_test sample
# Pool_dead_test_2.fq.gz # Reverse reads for Pool_dead_test sample
# Alternative naming (modify config.sh accordingly):
# Sample.R1.fastq.gz / Sample.R2.fastq.gz -> fa_ex_1='.R1.fastq.gz', fa_ex_2='.R2.fastq.gz'
# Sample_read1.fq.gz / Sample_read2.fq.gz -> fa_ex_1='_read1.fq.gz', fa_ex_2='_read2.fq.gz'
# Important: All samples must follow the same naming convention!cd genome
# Download Anopheles gambiae reference genome
wget http://ftp.ensemblgenomes.org/pub/metazoa/release-57/fasta/anopheles_gambiae/dna/Anopheles_gambiae.AgamP4.dna.toplevel.fa.gz
gunzip Anopheles_gambiae.AgamP4.dna.toplevel.fa.gz
# For testing with smaller data, use single chromosome:
# wget "http://ftp.ensemblgenomes.org/pub/metazoa/release-57/fasta/anopheles_gambiae/dna/Anopheles_gambiae.AgamP4.dna_sm.chromosome.2L.fa.gz"
# gunzip Anopheles_gambiae.AgamP4.dna_sm.chromosome.2L.fa.gz
# mv Anopheles_gambiae.AgamP4.dna_sm.chromosome.2L.fa Anopheles_gambiae.AgamP4.dna.toplevel.facd annotation
# Download GFF annotation
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/metazoa/release-57/gff3/anopheles_gambiae/Anopheles_gambiae.AgamP4.57.gff3.gz
gunzip Anopheles_gambiae.AgamP4.57.gff3.gz
# Extract genes only
awk '{if($3=="gene") print $0}' Anopheles_gambiae.AgamP4.57.gff3 > gene.only.gff
# For single chromosome testing:
# awk '{if($3=="gene" && $1=="2L") print $0}' Anopheles_gambiae.AgamP4.57.gff3 > gene.only.gffcd input
# Create pools.txt file with sample names and pool sizes
# Format: sample_name,pool_size
cat > pools.txt << EOF
Pool_Alive_test,50
Pool_dead_test,50
EOFThe pipeline uses config.sh to define file naming patterns:
# Edit pipeline/config.sh to match your file naming:
# For files named Sample_1.fq.gz / Sample_2.fq.gz (default):
fa_ex_1='_1.fq.gz'
fa_ex_2='_2.fq.gz'
# For files named Sample.R1.fastq.gz / Sample.R2.fastq.gz:
fa_ex_1='.R1.fastq.gz'
fa_ex_2='.R2.fastq.gz'
# For files named Sample_read1.fq.gz / Sample_read2.fq.gz:
fa_ex_1='_read1.fq.gz'
fa_ex_2='_read2.fq.gz'# For different species, update these in config.sh:
# Reference genome location
loc_genome='../genome/YOUR_SPECIES.fa'
# SnpEff database name (check available databases)
genome_DB='YOUR_SPECIES_DATABASE'
# Annotation file
genes_only='../annotation/YOUR_SPECIES_genes.gff'# Population genetics parameters in config.sh:
window_sliding_width='50000' # Window size for sliding window analysis
window_sliding_stride='25000' # Step size between windows
min_cov='10' # Minimum coverage filter
min_count='2' # Minimum allele countconda activate poolseq
cd pipeline
# Run complete pipeline
bash run_all.bashThe pipeline executes these steps automatically:
- QC: FastQC on raw reads
- Indexing: BWA-MEM2 genome indexing
- Alignment: Read alignment to reference
- Processing: BAM file processing (sort, mark duplicates, add read groups)
- QC: BAM file quality control
- Variant Calling: FreeBayes variant calling
- Annotation: SnpEff variant annotation
- Filtering: Variant filtering and separation
- Statistics: Variant statistics
- Conversion: VCF to table format
- Population Genetics: Fst, diversity, and frequency analysis
- MultiQC: Combined quality report
💡 Flexibility: The bash design allows running individual pipeline components
# Run only quality control
bash step_00_qc_fasta_files.sh
# Run only alignment steps
bash step_01_BWA_index_genome.sh
bash step_02_BWA_alignment.sh
# Run only variant calling
bash step_04_freebayes_variantcalling.sh
# Run only population genetics
bash step_09_grenedalf_snps.sh
⚠️ Important: Update the SnpSift oneperline script path for your environment
cd pipeline
# Find the correct path for SnpSift oneperline script
find $CONDA_PREFIX -name "vcfEffOnePerLine.pl" 2>/dev/null
# Edit config.sh and update the oneperline path:
# oneperline='/path/to/vcfEffOnePerLine.pl'Identifies significantly differentiated SNPs between populations using the Cochran-Mantel-Haenszel test.
Requirements: Completed main pipeline, grenedalf, popoolation2
Example:
cd pipeline
bash cmh_analysis.shOutput: 7_population_genetics/population.cmh containing SNPs with p-values
Customization:
# Edit cmh_analysis.sh for multiple population comparisons:
--population 1-2,1-3,2-3
# Adjust filtering:
--min-count 12 --min-coverage 50 --max-coverage 200Maps quantitative trait loci (QTL) by analyzing allele frequency differences between extreme phenotype pools.
Requirements: Completed main pipeline, R packages (tidyverse, ggplot2, docopt, QTLseqr)
Example:
cd pipeline
bash BSA_analysis.shOutputs:
8_BSA_analysis/results/snps/*.csv- QTL analysis results8_BSA_analysis/plots/snps/*.pdf- Statistical plots (SNP distribution, Δ(SNP-index), G' statistic)
Customization:
# Edit BSA_analysis.sh for different samples:
--highBulk=Resistant_pool --highBulk_size=100
--lowBulk=Susceptible_pool --lowBulk_size=100
--windowSize=1e6
# Multiple comparisons:
Rscript scripts/BSA.R --rawData=$snps_Data --highBulk=Pool1 --lowBulk=Pool2 ...
Rscript scripts/BSA.R --rawData=$snps_Data --highBulk=Pool1 --lowBulk=Pool3 ...poolseq-pipeline/
├── 0_QC/ # Quality control reports
│ ├── fasta_qc/ # FastQC reports
│ ├── bams_QC/ # BAM quality reports
│ └── multiqc_report.html # Combined QC report
├── 2_mapping/ # BAM alignment files
├── 3_processing/ # Processed BAM files
├── 5_freebayes_variants/ # Variant files (VCF, tables)
├── 7_population_genetics/ # Population genetics results
├── 8_BSA_analysis/ # BSA results (if run)
│ ├── results/snps/ # CSV result files
│ └── plots/snps/ # PDF plots
└── log/ # Log files for all steps
- Variants:
5_freebayes_variants/pooled.continuous.freebayes.parallel.snpeff.snps.vcf - SNP Table:
5_freebayes_variants/pooled.continuous.freebayes.parallel.snpeff.snps.table.tsv - Population Stats:
7_population_genetics/snps_windowed_*.csv - QC Report:
0_QC/multiqc_report.html
If you use this pipeline, please cite the tools and our papers: