Skip to content

Commit b003183

Browse files
authored
A new, improved Acanthophis (#1)
- [x] Full support for DPW2 pipeline - [x] Hologenomics pipeline - [x] Kraken2 - [x] Kaiju - [x] Centrifuge - [x] Megahit assembly of unmapped reads - [x] Unify log and data paths - [x] ~Cloud profiles via DPW2 work~ To be added later - [x] New config - [x] New rl2s.tsv - [x] Fix bug in khmer, and get kwip working again - [x] fastqc Pre & post - [x] multiqc - [x] qualimap - [x] QCtype - [x] theta_prior - [x] Define regions based on coverage not size
2 parents 62ec141 + 63c10f4 commit b003183

59 files changed

Lines changed: 1715 additions & 1575 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/python-publish.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ jobs:
2929
python -m build --sdist --wheel --outdir dist/
3030
3131
- name: Publish distribution to Test PyPI
32-
uses: pypa/gh-action-pypi-publish@master
32+
uses: pypa/gh-action-pypi-publish@release/v1
3333
if: "!startsWith(github.ref, 'refs/tags')"
3434
with:
3535
password: ${{ secrets.TEST_PYPI_API_TOKEN }}

.github/workflows/run-snakemake.yml

Lines changed: 23 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,30 +2,47 @@ name: Workflow CI
22

33
on:
44
push:
5-
branches: "**"
5+
branches: "main"
66
pull_request:
77
branches: [ main ]
88
workflow_dispatch:
99

1010
jobs:
11-
runsnkmk:
11+
run_workflow:
1212
runs-on: ubuntu-latest
1313
steps:
14-
- uses: actions/checkout@v2
14+
- uses: actions/checkout@v3
1515

1616
- name: setup-mamba
1717
uses: conda-incubator/setup-miniconda@v2
1818
with:
1919
mamba-version: "*"
2020
channels: bioconda,kdm801,conda-forge,defaults
2121
channel-priority: true
22+
23+
- name: Cache generated inputs
24+
id: cache-inputs
25+
uses: actions/cache@v3
26+
with:
27+
path: tests/rawdata
28+
key: rawdata-${{ hashFiles('tests/Snakefile.generate-rawdata') }}
2229

23-
- name: setup and run Snakemake
30+
- name: setup data
2431
shell: bash -el {0}
2532
run: |
2633
mamba info
27-
pushd example
34+
pushd tests
2835
source setup.sh
29-
snakemake --snakefile Snakefile -j 4 --use-conda --conda-frontend mamba
3036
popd
3137
38+
- name: run workflow
39+
shell: bash -el {0}
40+
run: |
41+
pushd tests
42+
source test.sh
43+
popd
44+
45+
- name: cat all log files on failure
46+
if: ${{ failure() }}
47+
run: |
48+
find tests/ -name '*.log' -exec tail -n 1000 {} \;

README.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,6 @@
22

33
A reusable, comprehensive, opinionated plant variant calling pipeline in Snakemake
44

5-
Until I write the documentation, please see [the example workflow](example/).
6-
It should contain a fully working example workflow.
7-
85
![Acanthophis, the most beautiful and badass of snakes](.github/logo.jpg)
96

107
## Installation & Use
@@ -17,10 +14,13 @@ conda activate someproject
1714
# install acanthophis itself
1815
pip install acanthophis
1916

20-
# generate boilerplate
17+
# generate a workspace. This copies all files the workflow will need to your workspace directory.
2118
acanthophis-init /path/to/someproject/
2219

23-
# edit config.yml to suit your project
20+
# edit config.yml to suit your project. Hopefully this config file documents
21+
# all options available in an understandable fashion. If not, please raise an
22+
# issue on github.
23+
2424
vim config.yml
2525

2626
# run snakemake
@@ -29,6 +29,9 @@ snakemake -j 16 -p --use-conda --conda-frontend mamba --ri
2929
snakemake --profile ./ebio-cluster/
3030
```
3131

32+
Until I write the documentation, please see [the example workflow](example/).
33+
It should contain a fully working example workflow.
34+
3235

3336
## About & Authors
3437

acanthophis/__init__.py

Lines changed: 6 additions & 161 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
# Copyright 2016-2022 Kevin Murray/Gekkonid Consulting
2+
#
3+
# This Source Code Form is subject to the terms of the Mozilla Public License,
4+
# v. 2.0. If a copy of the MPL was not distributed with this file, You can
5+
# obtain one at http://mozilla.org/MPL/2.0/.
6+
17
import csv
28
from collections import defaultdict
39
from glob import glob
@@ -16,14 +22,6 @@
1622

1723
HERE = os.path.abspath(os.path.dirname(__file__))
1824

19-
class __Rules(object):
20-
def __init__(self):
21-
for rulefile in glob(f"{HERE}/rules/*.rules"):
22-
rule = splitext(basename(rulefile))[0]
23-
setattr(self, rule, rulefile)
24-
25-
rules = __Rules()
26-
2725
profiles = {}
2826
for profiledir in glob(f"{HERE}/profiles/*"):
2927
profile = basename(profiledir)
@@ -32,156 +30,3 @@ def __init__(self):
3230

3331
def get_resource(file):
3432
return f"{HERE}/{file}"
35-
36-
def rule_resources(config, rule, **defaults):
37-
def resource(wildcards, attempt, value, maxvalue):
38-
return int(min(value * 2^(attempt-1), maxvalue))
39-
C = config.get("cluster_resources", {})
40-
maxes = C.get("max_values", {})
41-
global_defaults = C.get("defaults", {})
42-
rules = C.get("rules", {})
43-
44-
values = {}
45-
values.update(global_defaults)
46-
values.update(defaults)
47-
values.update(rules.get(rule, {}))
48-
ret = {}
49-
for res, val in values.items():
50-
if isinstance(val, str):
51-
# the logic below allows restarting with increased resources. If
52-
# the resource's value is string, you can't double it with each
53-
# attempt, so just return it as a constant.
54-
# this is used for things like cluster queues etc.
55-
ret[res] = val
56-
if C.get("DEBUG", False):
57-
print(rule, res, val)
58-
continue
59-
maxval = maxes.get(res, inf)
60-
if C.get("DEBUG", False):
61-
print(rule, res, val, maxval)
62-
ret[res] = partial(resource, value=val, maxvalue=maxval)
63-
return ret
64-
65-
66-
def populate_metadata(config, runlib2samp=None, sample_meta=None, setfile_glob=None):
67-
try:
68-
if runlib2samp is None:
69-
runlib2samp = config["metadata"]["runlib2samp_file"]
70-
if sample_meta is None:
71-
sample_meta = config["metadata"]["sample_meta_file"]
72-
if setfile_glob is None:
73-
setfile_glob = config["metadata"]["setfile_glob"]
74-
except KeyError as exc:
75-
raise ValueError("ERROR: metadata files must be configured in config, or passed to populate_metadata()")
76-
RL2S, S2RL = make_runlib2samp(runlib2samp)
77-
config["RUNLIB2SAMP"] = RL2S
78-
config["SAMP2RUNLIB"] = S2RL
79-
config["SAMPLESETS"] = make_samplesets(runlib2samp, setfile_glob)
80-
if "refs" not in config:
81-
raise RuntimeError("ERROR: reference(s) must be configured in config file")
82-
config["CHROMS"] = make_chroms(config["refs"])
83-
if "varcall" in config:
84-
config["VARCALL_REGIONS"] = {
85-
vc: make_regions(config["refs"], window=config["varcall"]["chunksize"][vc])
86-
for vc in config["varcall"]["chunksize"]
87-
}
88-
89-
90-
def parsefai(fai):
91-
with open(fai) as fh:
92-
for l in fh:
93-
cname, clen, _, _, _ = l.split()
94-
clen = int(clen)
95-
yield cname, clen
96-
97-
98-
def make_regions(rdict, window=1e6, base=1):
99-
window = int(window)
100-
ret = {}
101-
for refname, refbits in rdict.items():
102-
fai = refbits['fasta']+".fai"
103-
windows = []
104-
curwin = []
105-
curwinlen = 0
106-
for cname, clen in parsefai(fai):
107-
for start in range(0, clen, window):
108-
wlen = min(clen - start, window)
109-
windows.append("{}:{:09d}-{:09d}".format(cname, start + base, start+wlen))
110-
ret[refname] = windows
111-
return ret
112-
113-
114-
def make_chroms(rdict):
115-
ret = {}
116-
for refname, refbits in rdict.items():
117-
fai = refbits['fasta']+".fai"
118-
ref = dict()
119-
for cname, clen in parsefai(fai):
120-
ref[cname] = clen
121-
ret[refname] = ref
122-
return ret
123-
124-
125-
def _iter_metadata(s2rl_file):
126-
with open(s2rl_file) as fh:
127-
dialect = "excel"
128-
if s2rl_file.endswith(".tsv"):
129-
dialect = "excel-tab"
130-
for samp in csv.DictReader(fh, dialect=dialect):
131-
yield {k.lower(): v for k, v in samp.items()}
132-
133-
134-
def make_runlib2samp(s2rl_file):
135-
rl2s = {}
136-
s2rl = defaultdict(list)
137-
for run in _iter_metadata(s2rl_file):
138-
if not run["library"] or run["library"].lower().startswith("blank"):
139-
# Skip blanks
140-
continue
141-
if run.get("include", "Y") != "Y":
142-
# Remove non-sequenced ones
143-
continue
144-
rl = (run["run"], run["library"])
145-
samp = run["sample"]
146-
rl2s[rl] = samp
147-
s2rl[samp].append(rl)
148-
return dict(rl2s), dict(s2rl)
149-
150-
151-
def stripext(path, exts=".txt"):
152-
if isinstance(exts, str):
153-
exts = [exts,]
154-
for ext in exts:
155-
if path.endswith(ext):
156-
path = path[:-len(ext)]
157-
return path
158-
159-
160-
def make_samplesets(s2rl_file, setfile_glob):
161-
ssets = defaultdict(list)
162-
everything = set()
163-
for setfile in glob(setfile_glob):
164-
setname = stripext(basename(setfile), ".txt")
165-
with open(setfile) as fh:
166-
samples = [x.strip() for x in fh]
167-
ssets[setname] = samples
168-
everything.update(samples)
169-
ssets["all_samples"] = everything
170-
171-
if not os.path.exists("data/samplelists"):
172-
os.makedirs("data/samplelists", exist_ok=True)
173-
with open("data/samplelists/GENERATED_FILES_DO_NOT_EDIT", "w") as fh:
174-
print("you're probably looking for", setfile_glob, file=fh)
175-
for setname, setsamps in ssets.items():
176-
fname = "data/samplelists/{}.txt".format(setname)
177-
try:
178-
with open(fname) as fh:
179-
currsamps = set([l.strip() for l in fh])
180-
except IOError:
181-
currsamps = set()
182-
if set(setsamps) != currsamps:
183-
with open(fname, "w") as fh:
184-
print("WARNING: updating sample sets, this will trigger reruns", setname, file=stderr)
185-
for s in sorted(setsamps):
186-
print(s, file=fh)
187-
return {n: list(sorted(set(s))) for n, s in ssets.items()}

acanthophis/cmd.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
# Copyright 2016-2022 Kevin Murray/Gekkonid Consulting
2+
#
3+
# This Source Code Form is subject to the terms of the Mozilla Public License,
4+
# v. 2.0. If a copy of the MPL was not distributed with this file, You can
5+
# obtain one at http://mozilla.org/MPL/2.0/.
6+
17
import acanthophis
28
import argparse
39
import shutil
@@ -32,6 +38,7 @@ def prompt_yn(message, default=False):
3238
pass
3339
return res
3440

41+
3542
def init():
3643
"""acanthophis-init command entry point"""
3744
ap = argparse.ArgumentParser(description="Initialise an Acanthophis analysis directory", epilog=CLIDOC)
@@ -55,10 +62,16 @@ def init():
5562
exit(0)
5663

5764
template_dir = acanthophis.get_resource("template/")
65+
rules_dir = acanthophis.get_resource("rules/")
66+
envs_dir = acanthophis.get_resource("rules/envs/")
5867
if args.dryrun:
5968
print(f"cp -r {template_dir} {args.destdir}")
60-
elif args.force or args.yes or prompt_yn(f"cp -r {template_dir} -> {args.destdir}?"):
69+
print(f"cp -r {rules_dir} {args.destdir}/")
70+
print(f"cp -r {envs_dir} {args.destdir}/rules")
71+
elif args.force or args.yes or prompt_yn(f"cp -r {template_dir} -> {args.destdir}? (WARNING: make sure you have git add'd all files as they will be overwritten) "):
6172
shutil.copytree(template_dir, args.destdir, dirs_exist_ok=True)
73+
shutil.copytree(rules_dir, args.destdir + "/rules", dirs_exist_ok=True)
74+
shutil.copytree(envs_dir, args.destdir + "/rules/envs", dirs_exist_ok=True)
6275

6376
for profile in args.cluster_profile:
6477
if profile not in acanthophis.profiles:
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
__default__:
2-
output: "data/log/cluster/"
2+
output: "output/log/cluster/"
33
DEBUG: True
Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,18 @@
1-
cluster: "ebio-cluster/jobsubmit"
2-
cluster-config: "ebio-cluster/cluster.yml"
3-
cluster-status: "ebio-cluster/jobstatus"
4-
jobscript: "ebio-cluster/jobscript.sh"
5-
jobs: 100
6-
#immediate-submit: true
1+
cluster: "profiles/ebio-cluster/jobsubmit"
2+
cluster-config: "profiles/ebio-cluster/cluster.yml"
3+
cluster-status: "profiles/ebio-cluster/jobstatus"
4+
cluster-cancel: "qdel"
5+
jobscript: "profiles/ebio-cluster/jobscript.sh"
6+
jobs: 384
77
verbose: false
8-
#notemp: true
98
use-conda: true
109
conda-frontend: mamba
1110
rerun-incomplete: true
1211
keep-going: true
13-
nolock: true
1412
max-jobs-per-second: 5
15-
max-status-checks-per-second: 1
16-
latency-wait: 60
17-
local-cores: 1
13+
max-status-checks-per-second: 5
14+
latency-wait: 300
15+
local-cores: 64
1816
restart-times: 3
19-
scheduler: greedy
2017
printshellcmds: true
18+
configfiles: "profiles/ebio-cluster/resources.yml"
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#!/bin/bash -l
22
# properties = {properties}
3-
3+
test -f ~/.bash_env && source ~/.bash_env
44
set -ueo pipefail
55
{exec_job}

acanthophis/profiles/ebio-cluster/jobstatus

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,18 @@ qstat_status="$(qstat -u '*' 2>/dev/null | awk '$1 == "'"$1"'"{print $5}')"
77
if [ -z "$qstat_status" ]
88
then
99
# job has finished
10-
qacct_errcode=$(qacct -j "$1" | awk '$1 == "exit_status"{print $2}')
10+
qacct_errcode=$(qacct -j "$1" 2>/dev/null | awk '$1 == "exit_status"{print $2}')
1111
#echo "qacct status $qacct_errcode" >&2
12+
# it can take a while for jobs to be visible to qacct. We wait for valid output from qacct before proceeding.
13+
while [ -z "$qacct_errcode" ]
14+
do
15+
#echo "No post-job data for $1 yet" >&2
16+
sleep 10
17+
qacct_errcode=$(qacct -j "$1" 2>/dev/null | awk '$1 == "exit_status"{print $2}')
18+
done
1219
if [ "$qacct_errcode" -ne 0 ]
1320
then
21+
echo "job $1 failed: exit status $qacct_errcode" >&2
1422
echo failed
1523
else
1624
echo success
@@ -19,6 +27,7 @@ else
1927
# running or errored
2028
if [[ "$qstat_status" =~ E ]]
2129
then
30+
echo "job $1 failed: qstat status $qstat_status" >&2
2231
echo failed
2332
else
2433
echo running

0 commit comments

Comments
 (0)