Pipelining and Pipeliner

David Wheeler, PhD, CCBR, NCI
3/7/2016

Data Analysis Pipelines

Desired Characteristics

Reproducible
Graceful Restart
Error Detection
Logging of Progress/Errors
Support for Parallel Excution
Workflow Documentation
Robust

Reproducibility

All Components Unambiguously Defined

Programs and Scripts used are Completely Specified
- Full paths define specific program file
- Program versions are constsant
Reference Databases are Completely Specified
- Full paths and versions
Analysis Steps are Well Defined
- Not possible to skip steps

Graceful Restart

Pipeline State is Determined Automatically

Interruped Analysis Begins where it Left Off
Corrupted Files are Detected and Re-Created
Completed Files are not Recreated
Status of Analysis is Reported

Error Detection

Errors are Not Propagated

Errors Stop the Analysis
- Missing files
- Abnormal Program Termination
- Logical Flaws in Workflow

Logging of Progress/Errors

Pipeline State Recorded

Logs Show Status of Each Completed Step
Informative Error Messages are Generated

Support for Parallel Excution

Parallel Execution Requires no Special Effort

Intelligent Job Submission on Clusters
- Analysis Must be Modular
- Steps that are Independent can be Executed in Parallel
Tracking of Job Status

Workflow Documentation

All Elements of Workflow Recorded and Replayable

Steps Performed Clearly Documented
All Resources Used Documented
- Data files
- Reference files
- Programs/Scripts
- Parameters used

Robust

Pipeline Does Not Break Easily

Increasing Complexity does not Destabilize
- More Steps
- More Files
- More Parameters
Changing Batch Queuing System does not Destabilize
- Torque
- Slurm
- None

Snakemake

Snakemake: Python-Based, Inspired by Unix Make

Modular Rules Comprise Workflow
Rules Defined in Text With Simple Syntax
Structured Json Files Hold Parameters
Support for Parallel Execution
Logging
Logic Checks
Dependency Tracking

Modular Rules Appear in a 'Snakefile' [snakemake -s 'Snakefile']

Workflow Defined in a Single Plain Text 'Snakefile'

data=["microbial","creatininase"]

rule final:
    input: expand("{x}.counts",x=data)

rule sortuniq:
    input: "{x}.out"
    output: "{x}.counts"
    shell: """
          cat {input}|cut -f2|sort|uniq -c > {output}
           """

The Steps [snakemake -s sortuniq.rl --dryrun]

Dry Runs Allow Logic Testing Prior to Real Run

rule sortuniq:
        input: creatininase.out
        output: creatininase.counts
rule sortuniq:
        input: microbial.out
        output: microbial.counts
localrule final:
        input: microbial.counts, creatininase.counts
Job counts:
        count   jobs
        1       final
        2       sortuniq
        3

Running the Pipeline [snakemake -s sortuniq.rl -j 2]

Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       final
        2       sortuniq
        3
rule sortuniq:
        input: microbial.out
        output: microbial.counts
1 of 3 steps (33%) done
rule sortuniq:
        input: creatininase.out
        output: creatininase.counts
2 of 3 steps (67%) done
localrule final:
        input: microbial.counts, creatininase.counts
3 of 3 steps (100%) done

Cluster Execution [snakemake -s sortuniq.rl -j 2 --cluster "sbatch"]

Provided cluster nodes: 2
Job counts:
        count   jobs
        1       final
        2       sortuniq
        3
rule sortuniq:
        input: microbial.out
        output: microbial.counts
rule sortuniq:
        input: creatininase.out
        output: creatininase.counts
1 of 3 steps (33%) done
2 of 3 steps (67%) done
localrule final:
        input: microbial.counts, creatininase.counts
3 of 3 steps (100%) done

Missing Input [snakemake -s sortuniq.rl --dryrun]

Missing Input Stops the Pipeline: No Silent Errors

MissingInputException in line 6 of /home/dwheeler/exometalk/sortuniq.rl:
Missing input files for rule sortuniq:
creatininase.out

Graceful Resume [snakemake -s sortuniq.rl --dryrun]

Only the Missing Output File Will be Created

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       final
        1       sortuniq
        2
rule sortuniq:
        input: creatininase.out
        output: creatininase.counts
1 of 2 steps (50%) done
localrule final:
        input: microbial.counts, creatininase.counts
2 of 2 steps (100%) done

Diagram of a Simple Workflow [snakemake -s sortuniq.rl --dag|dot -Tpng| display]

Useful for Publication

alt text

Created by Snakemake as a dot file
Need the dot Program to Convert to PNG
Can Edit the dot File to Add Annotations
dot Usually Available on Unix/Linux

Structured Parameters

Pipeline Parameters Can be Stored in Structured Files

Snakemake Configuration File
- Structure is JSON (Javascript Object Notation)
- Referenced within Snakefile

{        
    "data": ["microbial","creatininase"],
}

Structured Parameters: Use Within the Snakefile

The Input Identifiers are Stored in params.json

configfile: "params.json"

rule final:
    input: expand("{x}.counts",x=config['data'])

rule sortuniq:
    input: "{x}.out"
    output: "{x}.counts"
    shell: """
          cat {input}|cut -f2|sort|uniq -c > {output}
           """

Pipeliner: An Interface to Snakemake [/data/CCBR/apps/Pipeliner/runpipe.sh]

alt text

ssh -Y dwheeler@biowulf2

GUI Written in Python 3
Tkinter Widget Set
Tabbed Interface
- Project Info, Parameter Selection, Program Json Viewer
- Run Sequence, Job Monitor, Comments Editor, Manual
To Run on Biowulf2 Need X11 Client
- Nor problem with Linux/Unix–X11 is there
- For Mac Need to install XQuartz http://www.xquartz.org/
- For Windows Need Cygwin or Xwin32

Pipeliner: Chosing Parameter Sets and the Pipeline to Run

alt text

The Annotation Set is Chosen
- A Json file is selected that contains paths to reference files
- References are self-consistent and comes in as a complete package
- human, mouse, rat
- various builds
The binary set is chosen
- A Json file is selected that contains paths to programs
- The program set is self-consistent and comes in as a complete package
The Pipeline is Chosen
- Initialqc, exomeseq-germline, exomeseq-somatic, wgslow

Pipeliner: Setting up a Run

alt text

The Working Directory is Specified and Initialized
- Some subdirectories are created and stocked with scripts
The Data Source is Specified
- Symbolic Links are Made from the Working Directory to the Data
A Dry Run is Made to Detect Errors
The Pipeline is Submitted to the Batch Queueing System

Pipeliner: Results of a Dry Run

alt text

Every Job that will be Run is Listed
The Total Number of Jobs to be Run of Each Type is Listed
Missing Files will be Detected Here.
Errors in Pipeline Logic will be Detected

Pipeliner: The Initialqc and Exomeseq-germline Pipeline Diagrams

alt text

Pipeliner: The Json Containing the Full Specification for the Project

alt text

Can be Saved Within the GUI
Can be Loaded Within the GUI to Recreate an Analysis
Can be Edited Within the GUI to Tweak Parameters (hacking)

Slurm Queue: Two Qualimap Jobs Pending [squeue -u dwheeler]

The Master Pipeline Process and Two Pending Jobs are Shown

[dwheeler@biowulf exometalk]$ squeue -u dwheeler
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          15570447       ccr pl:quali dwheeler PD       0:00      1 (Resources)
          15570448       ccr pl:quali dwheeler PD       0:00      1 (Priority)
          15570258       ccr Pipeline dwheeler  R       2:59      1 cn0695

Pipeliner: Partial Log file

[Mon Mar  7 11:48:31 2016] Provided cluster nodes: 100
[Mon Mar  7 11:48:31 2016] Job counts:
    count   jobs
    1   all_initialqc
    2   fastqc_fastq
    2   fastqc_trimmed
    2   ngsqc
    2   novocraft_novoalign
    2   novocraft_sort
    2   picard_headers
    2   picard_markdups
    2   qualimap
    2   trimmomatic
    19
[Mon Mar  7 11:48:31 2016] rule fastqc_fastq:
    input: F23_10000.R1.fastq.gz, F23_10000.R2.fastq.gz
    output: QC/F23_10000.R1_fastqc.html, QC/F23_10000.R2_fastqc.html
    threads: 8
[Mon Mar  7 11:48:31 2016] export JAVA_OPTS='-Djava.io.tmpdir=/scratch'; /usr/local/apps/fastqc/0.11.2/fastqc -o QC -f fastq --threads 8 --contaminants /data/CCBR/dev/Pipeline/Pipeliner/Data/fastqc.adapters F23_10000.R1.fastq.gz F23_10000.R2.fastq.gz
[Mon Mar  7 11:48:32 2016] rule trimmomatic:
    input: F22_10000.R1.fastq.gz, F22_10000.R2.fastq.gz
    output: F22_10000.R1.trimmed.fastq.gz, F22_10000.R1.trimmed.unpair.fastq.gz, F22_10000.R2.trimmed.fastq.gz, F22_10000.R2.trimmed.unpair.fastq.gz
    threads: 4
[Mon Mar  7 11:48:32 2016] 
            java -Xmx8g -Djava.io.tmpdir=/scratch -jar /usr/local/apps/trimmomatic/Trimmomatic-0.33/trimmomatic-0.33.jar PE -threads 4 -phred33 F22_10000.R1.fastq.gz F22_10000.R2.fastq.gz F22_10000.R1.trimmed.fastq.gz F22_10000.R1.trimmed.unpair.fastq.gz F22_10000.R2.trimmed.fastq.gz F22_10000.R2.trimmed.unpair.fastq.gz ILLUMINACLIP:/data/CCBR/dev/Pipeline/Pipeliner/Data/adapters2.fa:3:30:10 LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:20

Pipeliner: Partial Annotation Json Defining Reference Files

 {"references": {
    "BWAGENOME": "/fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa",
    "GENOME": "/fdb/GATK_resource_bundle/b37/human_g1k_v37.fasta",
    "INDELSITES": "/fdb/GATK_resource_bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf", 
    "NOVOINDEX": "/data/CCBR/local/lib/human_g1k_v37_iupac.nix", 
    "REFFLAT": "/data/CCBR/local/lib/SS_exome.bed", 
    "SNPSITES": "/fdb/GATK_resource_bundle/b37/dbsnp_138.b37.vcf",
    "INDELSITES2":"/fdb/GATK_resource_bundle/b37/1000G_phase1.indels.b37.vcf",
    "ANNDIR": "/usr/local/apps/ANNOVAR/2014-11-12/humandb/",
    "tg_GS_INDELS": "/fdb/GATK_resource_bundle/hg19-2.8/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz",
    "tg_PHASE_INDELS": "/fdb/GATK_resource_bundle/hg19-2.8/1000G_phase1.indels.hg19.vcf.gz",
    "SNP138": "/fdb/GATK_resource_bundle/hg19-2.8/dbsnp_138.hg19.vcf.gz",
    "HAPMAP": "/fdb/GATK_resource_bundle/hg19-2.8/hapmap_3.3.hg19.vcf.gz",
    "B1K": "/fdb/GATK_resource_bundle/hg19-2.8/1000G_phase1.snps.high_confidence.hg19.vcf.gz",
    "OMNI": "/fdb/GATK_resource_bundle/hg19-2.8/1000G_omni2.5.hg19.vcf.gz", 
    "adapter.file": "/data/CCBR/dev/Pipeline/Pipeliner/Data/TruSeq_and_nextera_adapters.ngsqc.dat",

Pipeliner: Partial Binary/Script Json Defining Programs Used in Pipelines

{
     "bin": {
        "NOVOALIGN": "/usr/local/apps/novocraft/3.02.10/novoalign",
        "NOVOSORT": "/usr/local/apps/novocraft/3.02.10/novosort",
        "ANNOVAR1": "/usr/local/apps/ANNOVAR/2014-07-14/convert2annovar.pl", 
        "ANNOVAR2": "/usr/local/apps/ANNOVAR/2014-07-14/table_annovar.pl", 
        "INDEXBAM": "java -jar /usr/local/apps/picard/1.129/picard.jar BuildBamIndex", 
        "COVCALC": "QC/Coverage_calc4.pl", 
        "COVFREQ": "/data/CCBR/local/lib/Cov_Frequency_targeted.R",
        "GATK": "java -Xmx64g -Djava.io.tmpdir=/scratch -jar /usr/local/apps/GATK/3.3-0/GenomeAnalysisTK.jar", 
        "MARKDUPS": "java -Xmx16g -Djava.io.tmpdir=/scratch -jar /usr/local/apps/picard/1.129/picard.jar MarkDuplicates", 
        "PICARD1": "java -Djava.io.tmpdir=/scratch -jar /usr/local/apps/picard/1.129/picard.jar AddOrReplaceReadGroups", 
        "PICARD2": "java -Xmx4G -Djava.io.tmpdir=/scratch -jar /usr/local/apps/picard/1.129/picard.jar CollectInsertSizeMetrics", 
        "PICHIST": "/data/CCBR/local/lib/picardhist.R",

Pipeliner: Partial Rules Json Defining Rules Belonging to Pipelines

 {
    "rules": {
        "ngsqc": ["initialqc","wgslow"],
        "fastqc.fastq": ["initialqc","wgslow"],
        "fastqc.trimmed": ["initialqc","wgslow"],
        "trimmomatic": ["initialqc","wgslow"],
    "samtools.flagstats": ["none"],
    "samtools.flagstats.dedup": ["none"],
        "qualimap": ["initialqc","wgslow"],
        "bwa.pe":["wgslow"],
        "novocraft.novoalign":["initialqc"],
        "bwa.index.ref":["none"],
        "annovar":["exomeseq-pairs"], 
        "script.checkqc":["exomeseq-pairs","exomeseq-somatic","exomeseq-germline","exomeseq-germline-recal","exomeseq-germline-partial"],
        "samtools.sam2bam":["initialqc","wgslow"], 
        "script.coverage.qc":["none"],

Pipeliner: A Modular Rule

The Snakefile is Built from a Modular Rule Library

rule picard_markdups:
     input:  "{x}.sorted.bam"
     output: out = temp("{x}.dedup.bam"),
             metrics = "{x}.sorted.txt"
     params: markdups=config['bin']['MARKDUPS']
     shell:  "{params.markdups} I={input} O={output.out} M={output.metrics} REMOVE_DUPLICATES=TRUE AS=TRUE PG='null'"

Pipeliner: Partial Snakefile for Exomeseq-germline Pipeline

import os
configfile: "run.json"
pairs=sorted(list(config['project']['pairs'].keys()))
rule all_exomeseq_germline:
    input:  "combined.gvcf",
             expand("all.{type}.dbnsfp.vcf", type=["snp","indel"])
    output: 
include: "/data/CCBR/apps/Pipeliner/Rules/gatk.combine.gvcfs.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/gatk.genotype.gvcfs.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/gatk.haplotype.caller.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/gatk.realign.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/picard.headers.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/picard.markdups.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/script.batchgvcf.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/script.checkqc.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/script.split.gvcfs.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/snpeff.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/snpeff.dbnsfp.rl"

Pipeliner: Partial Project Json Fully Specifying an Analysis

{
    "project": {
        "analyst": "dwheeler",
        "annotation": "hg19",
        "batchsize": "20",
        "binset": "standard-bin",
        "bysample": "no",
        "cluster": "cluster_medium.json",
        "comments": "Enter comments here.\n",
        "custom": [],
        "datapath": "/data/CCBR/dev/PipelineTestSeqs/exomeseq/human/germline",
        "efiletype": "fastq",
        "filetype": "fastq.gz",
        "id": "Someid",
        "organism": "human",

Pipelining and Pipeliner

Data Analysis Pipelines

Desired Characteristics

Reproducibility

All Components Unambiguously Defined

Graceful Restart

Pipeline State is Determined Automatically

Error Detection

Errors are Not Propagated

Logging of Progress/Errors

Pipeline State Recorded

Support for Parallel Excution

Parallel Execution Requires no Special Effort

Workflow Documentation

All Elements of Workflow Recorded and Replayable

Robust

Pipeline Does Not Break Easily

Snakemake

Snakemake: Python-Based, Inspired by Unix Make

Modular Rules Appear in a 'Snakefile' [snakemake -s 'Snakefile']

Workflow Defined in a Single Plain Text 'Snakefile'

The Steps [snakemake -s sortuniq.rl --dryrun]

Dry Runs Allow Logic Testing Prior to Real Run

Running the Pipeline [snakemake -s sortuniq.rl -j 2]

Cluster Execution [snakemake -s sortuniq.rl -j 2 --cluster "sbatch"]

Missing Input [snakemake -s sortuniq.rl --dryrun]

Missing Input Stops the Pipeline: No Silent Errors

Graceful Resume [snakemake -s sortuniq.rl --dryrun]

Only the Missing Output File Will be Created

Diagram of a Simple Workflow [snakemake -s sortuniq.rl --dag|dot -Tpng| display]

Useful for Publication

Structured Parameters

Pipeline Parameters Can be Stored in Structured Files

Structured Parameters: Use Within the Snakefile

The Input Identifiers are Stored in params.json

Pipeliner: An Interface to Snakemake [/data/CCBR/apps/Pipeliner/runpipe.sh]

ssh -Y dwheeler@biowulf2

Pipeliner: Chosing Parameter Sets and the Pipeline to Run

Pipeliner: Setting up a Run

Pipeliner: Results of a Dry Run

Pipeliner: The Initialqc and Exomeseq-germline Pipeline Diagrams

Pipeliner: The Json Containing the Full Specification for the Project

Slurm Queue: Two Qualimap Jobs Pending [squeue -u dwheeler]

The Master Pipeline Process and Two Pending Jobs are Shown

Pipeliner: Partial Log file

Pipeliner: Partial Annotation Json Defining Reference Files

Pipeliner: Partial Binary/Script Json Defining Programs Used in Pipelines

Pipeliner: Partial Rules Json Defining Rules Belonging to Pipelines

Pipeliner: A Modular Rule

The Snakefile is Built from a Modular Rule Library

Pipeliner: Partial Snakefile for Exomeseq-germline Pipeline

Pipeliner: Partial Project Json Fully Specifying an Analysis