David Wheeler, PhD, CCBR, NCI
3/7/2016
Desired Characteristics
All Components Unambiguously Defined
Pipeline State is Determined Automatically
Errors are Not Propagated
Pipeline State Recorded
Parallel Execution Requires no Special Effort
All Elements of Workflow Recorded and Replayable
Pipeline Does Not Break Easily
Snakemake: Python-Based, Inspired by Unix Make
Workflow Defined in a Single Plain Text 'Snakefile'
data=["microbial","creatininase"]
rule final:
input: expand("{x}.counts",x=data)
rule sortuniq:
input: "{x}.out"
output: "{x}.counts"
shell: """
cat {input}|cut -f2|sort|uniq -c > {output}
"""
Dry Runs Allow Logic Testing Prior to Real Run
rule sortuniq:
input: creatininase.out
output: creatininase.counts
rule sortuniq:
input: microbial.out
output: microbial.counts
localrule final:
input: microbial.counts, creatininase.counts
Job counts:
count jobs
1 final
2 sortuniq
3
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 final
2 sortuniq
3
rule sortuniq:
input: microbial.out
output: microbial.counts
1 of 3 steps (33%) done
rule sortuniq:
input: creatininase.out
output: creatininase.counts
2 of 3 steps (67%) done
localrule final:
input: microbial.counts, creatininase.counts
3 of 3 steps (100%) done
Provided cluster nodes: 2
Job counts:
count jobs
1 final
2 sortuniq
3
rule sortuniq:
input: microbial.out
output: microbial.counts
rule sortuniq:
input: creatininase.out
output: creatininase.counts
1 of 3 steps (33%) done
2 of 3 steps (67%) done
localrule final:
input: microbial.counts, creatininase.counts
3 of 3 steps (100%) done
Missing Input Stops the Pipeline: No Silent Errors
MissingInputException in line 6 of /home/dwheeler/exometalk/sortuniq.rl:
Missing input files for rule sortuniq:
creatininase.out
Only the Missing Output File Will be Created
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 final
1 sortuniq
2
rule sortuniq:
input: creatininase.out
output: creatininase.counts
1 of 2 steps (50%) done
localrule final:
input: microbial.counts, creatininase.counts
2 of 2 steps (100%) done
Useful for Publication
Pipeline Parameters Can be Stored in Structured Files
{
"data": ["microbial","creatininase"],
}
The Input Identifiers are Stored in params.json
configfile: "params.json"
rule final:
input: expand("{x}.counts",x=config['data'])
rule sortuniq:
input: "{x}.out"
output: "{x}.counts"
shell: """
cat {input}|cut -f2|sort|uniq -c > {output}
"""
ssh -Y dwheeler@biowulf2
The Master Pipeline Process and Two Pending Jobs are Shown
[dwheeler@biowulf exometalk]$ squeue -u dwheeler
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
15570447 ccr pl:quali dwheeler PD 0:00 1 (Resources)
15570448 ccr pl:quali dwheeler PD 0:00 1 (Priority)
15570258 ccr Pipeline dwheeler R 2:59 1 cn0695
[Mon Mar 7 11:48:31 2016] Provided cluster nodes: 100
[Mon Mar 7 11:48:31 2016] Job counts:
count jobs
1 all_initialqc
2 fastqc_fastq
2 fastqc_trimmed
2 ngsqc
2 novocraft_novoalign
2 novocraft_sort
2 picard_headers
2 picard_markdups
2 qualimap
2 trimmomatic
19
[Mon Mar 7 11:48:31 2016] rule fastqc_fastq:
input: F23_10000.R1.fastq.gz, F23_10000.R2.fastq.gz
output: QC/F23_10000.R1_fastqc.html, QC/F23_10000.R2_fastqc.html
threads: 8
[Mon Mar 7 11:48:31 2016] export JAVA_OPTS='-Djava.io.tmpdir=/scratch'; /usr/local/apps/fastqc/0.11.2/fastqc -o QC -f fastq --threads 8 --contaminants /data/CCBR/dev/Pipeline/Pipeliner/Data/fastqc.adapters F23_10000.R1.fastq.gz F23_10000.R2.fastq.gz
[Mon Mar 7 11:48:32 2016] rule trimmomatic:
input: F22_10000.R1.fastq.gz, F22_10000.R2.fastq.gz
output: F22_10000.R1.trimmed.fastq.gz, F22_10000.R1.trimmed.unpair.fastq.gz, F22_10000.R2.trimmed.fastq.gz, F22_10000.R2.trimmed.unpair.fastq.gz
threads: 4
[Mon Mar 7 11:48:32 2016]
java -Xmx8g -Djava.io.tmpdir=/scratch -jar /usr/local/apps/trimmomatic/Trimmomatic-0.33/trimmomatic-0.33.jar PE -threads 4 -phred33 F22_10000.R1.fastq.gz F22_10000.R2.fastq.gz F22_10000.R1.trimmed.fastq.gz F22_10000.R1.trimmed.unpair.fastq.gz F22_10000.R2.trimmed.fastq.gz F22_10000.R2.trimmed.unpair.fastq.gz ILLUMINACLIP:/data/CCBR/dev/Pipeline/Pipeliner/Data/adapters2.fa:3:30:10 LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:20
{"references": {
"BWAGENOME": "/fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa",
"GENOME": "/fdb/GATK_resource_bundle/b37/human_g1k_v37.fasta",
"INDELSITES": "/fdb/GATK_resource_bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf",
"NOVOINDEX": "/data/CCBR/local/lib/human_g1k_v37_iupac.nix",
"REFFLAT": "/data/CCBR/local/lib/SS_exome.bed",
"SNPSITES": "/fdb/GATK_resource_bundle/b37/dbsnp_138.b37.vcf",
"INDELSITES2":"/fdb/GATK_resource_bundle/b37/1000G_phase1.indels.b37.vcf",
"ANNDIR": "/usr/local/apps/ANNOVAR/2014-11-12/humandb/",
"tg_GS_INDELS": "/fdb/GATK_resource_bundle/hg19-2.8/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz",
"tg_PHASE_INDELS": "/fdb/GATK_resource_bundle/hg19-2.8/1000G_phase1.indels.hg19.vcf.gz",
"SNP138": "/fdb/GATK_resource_bundle/hg19-2.8/dbsnp_138.hg19.vcf.gz",
"HAPMAP": "/fdb/GATK_resource_bundle/hg19-2.8/hapmap_3.3.hg19.vcf.gz",
"B1K": "/fdb/GATK_resource_bundle/hg19-2.8/1000G_phase1.snps.high_confidence.hg19.vcf.gz",
"OMNI": "/fdb/GATK_resource_bundle/hg19-2.8/1000G_omni2.5.hg19.vcf.gz",
"adapter.file": "/data/CCBR/dev/Pipeline/Pipeliner/Data/TruSeq_and_nextera_adapters.ngsqc.dat",
{
"bin": {
"NOVOALIGN": "/usr/local/apps/novocraft/3.02.10/novoalign",
"NOVOSORT": "/usr/local/apps/novocraft/3.02.10/novosort",
"ANNOVAR1": "/usr/local/apps/ANNOVAR/2014-07-14/convert2annovar.pl",
"ANNOVAR2": "/usr/local/apps/ANNOVAR/2014-07-14/table_annovar.pl",
"INDEXBAM": "java -jar /usr/local/apps/picard/1.129/picard.jar BuildBamIndex",
"COVCALC": "QC/Coverage_calc4.pl",
"COVFREQ": "/data/CCBR/local/lib/Cov_Frequency_targeted.R",
"GATK": "java -Xmx64g -Djava.io.tmpdir=/scratch -jar /usr/local/apps/GATK/3.3-0/GenomeAnalysisTK.jar",
"MARKDUPS": "java -Xmx16g -Djava.io.tmpdir=/scratch -jar /usr/local/apps/picard/1.129/picard.jar MarkDuplicates",
"PICARD1": "java -Djava.io.tmpdir=/scratch -jar /usr/local/apps/picard/1.129/picard.jar AddOrReplaceReadGroups",
"PICARD2": "java -Xmx4G -Djava.io.tmpdir=/scratch -jar /usr/local/apps/picard/1.129/picard.jar CollectInsertSizeMetrics",
"PICHIST": "/data/CCBR/local/lib/picardhist.R",
{
"rules": {
"ngsqc": ["initialqc","wgslow"],
"fastqc.fastq": ["initialqc","wgslow"],
"fastqc.trimmed": ["initialqc","wgslow"],
"trimmomatic": ["initialqc","wgslow"],
"samtools.flagstats": ["none"],
"samtools.flagstats.dedup": ["none"],
"qualimap": ["initialqc","wgslow"],
"bwa.pe":["wgslow"],
"novocraft.novoalign":["initialqc"],
"bwa.index.ref":["none"],
"annovar":["exomeseq-pairs"],
"script.checkqc":["exomeseq-pairs","exomeseq-somatic","exomeseq-germline","exomeseq-germline-recal","exomeseq-germline-partial"],
"samtools.sam2bam":["initialqc","wgslow"],
"script.coverage.qc":["none"],
The Snakefile is Built from a Modular Rule Library
rule picard_markdups:
input: "{x}.sorted.bam"
output: out = temp("{x}.dedup.bam"),
metrics = "{x}.sorted.txt"
params: markdups=config['bin']['MARKDUPS']
shell: "{params.markdups} I={input} O={output.out} M={output.metrics} REMOVE_DUPLICATES=TRUE AS=TRUE PG='null'"
import os
configfile: "run.json"
pairs=sorted(list(config['project']['pairs'].keys()))
rule all_exomeseq_germline:
input: "combined.gvcf",
expand("all.{type}.dbnsfp.vcf", type=["snp","indel"])
output:
include: "/data/CCBR/apps/Pipeliner/Rules/gatk.combine.gvcfs.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/gatk.genotype.gvcfs.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/gatk.haplotype.caller.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/gatk.realign.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/picard.headers.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/picard.markdups.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/script.batchgvcf.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/script.checkqc.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/script.split.gvcfs.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/snpeff.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/snpeff.dbnsfp.rl"
{
"project": {
"analyst": "dwheeler",
"annotation": "hg19",
"batchsize": "20",
"binset": "standard-bin",
"bysample": "no",
"cluster": "cluster_medium.json",
"comments": "Enter comments here.\n",
"custom": [],
"datapath": "/data/CCBR/dev/PipelineTestSeqs/exomeseq/human/germline",
"efiletype": "fastq",
"filetype": "fastq.gz",
"id": "Someid",
"organism": "human",