Pipelining and Pipeliner

David Wheeler, PhD, CCBR, NCI
3/7/2016

Data Analysis Pipelines

Desired Characteristics

  • Reproducible
  • Graceful Restart
  • Error Detection
  • Logging of Progress/Errors
  • Support for Parallel Excution
  • Workflow Documentation
  • Robust

Reproducibility

All Components Unambiguously Defined

  • Programs and Scripts used are Completely Specified
    • Full paths define specific program file
    • Program versions are constsant
  • Reference Databases are Completely Specified
    • Full paths and versions
  • Analysis Steps are Well Defined
    • Not possible to skip steps

Graceful Restart

Pipeline State is Determined Automatically

  • Interruped Analysis Begins where it Left Off
  • Corrupted Files are Detected and Re-Created
  • Completed Files are not Recreated
  • Status of Analysis is Reported

Error Detection

Errors are Not Propagated

  • Errors Stop the Analysis
    • Missing files
    • Abnormal Program Termination
    • Logical Flaws in Workflow

Logging of Progress/Errors

Pipeline State Recorded

  • Logs Show Status of Each Completed Step
  • Informative Error Messages are Generated

Support for Parallel Excution

Parallel Execution Requires no Special Effort

  • Intelligent Job Submission on Clusters
    • Analysis Must be Modular
    • Steps that are Independent can be Executed in Parallel
  • Tracking of Job Status

Workflow Documentation

All Elements of Workflow Recorded and Replayable

  • Steps Performed Clearly Documented
  • All Resources Used Documented
    • Data files
    • Reference files
    • Programs/Scripts
    • Parameters used

Robust

Pipeline Does Not Break Easily

  • Increasing Complexity does not Destabilize
    • More Steps
    • More Files
    • More Parameters
  • Changing Batch Queuing System does not Destabilize
    • Torque
    • Slurm
    • None

Snakemake

Snakemake: Python-Based, Inspired by Unix Make

  • Modular Rules Comprise Workflow
  • Rules Defined in Text With Simple Syntax
  • Structured Json Files Hold Parameters
  • Support for Parallel Execution
  • Logging
  • Logic Checks
  • Dependency Tracking

Modular Rules Appear in a 'Snakefile' [snakemake -s 'Snakefile']

Workflow Defined in a Single Plain Text 'Snakefile'

data=["microbial","creatininase"]

rule final:
    input: expand("{x}.counts",x=data)

rule sortuniq:
    input: "{x}.out"
    output: "{x}.counts"
    shell: """
          cat {input}|cut -f2|sort|uniq -c > {output}
           """

The Steps [snakemake -s sortuniq.rl --dryrun]

Dry Runs Allow Logic Testing Prior to Real Run

rule sortuniq:
        input: creatininase.out
        output: creatininase.counts
rule sortuniq:
        input: microbial.out
        output: microbial.counts
localrule final:
        input: microbial.counts, creatininase.counts
Job counts:
        count   jobs
        1       final
        2       sortuniq
        3

Running the Pipeline [snakemake -s sortuniq.rl -j 2]

Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       final
        2       sortuniq
        3
rule sortuniq:
        input: microbial.out
        output: microbial.counts
1 of 3 steps (33%) done
rule sortuniq:
        input: creatininase.out
        output: creatininase.counts
2 of 3 steps (67%) done
localrule final:
        input: microbial.counts, creatininase.counts
3 of 3 steps (100%) done

Cluster Execution [snakemake -s sortuniq.rl -j 2 --cluster "sbatch"]

Provided cluster nodes: 2
Job counts:
        count   jobs
        1       final
        2       sortuniq
        3
rule sortuniq:
        input: microbial.out
        output: microbial.counts
rule sortuniq:
        input: creatininase.out
        output: creatininase.counts
1 of 3 steps (33%) done
2 of 3 steps (67%) done
localrule final:
        input: microbial.counts, creatininase.counts
3 of 3 steps (100%) done

Missing Input [snakemake -s sortuniq.rl --dryrun]

Missing Input Stops the Pipeline: No Silent Errors

MissingInputException in line 6 of /home/dwheeler/exometalk/sortuniq.rl:
Missing input files for rule sortuniq:
creatininase.out

Graceful Resume [snakemake -s sortuniq.rl --dryrun]

Only the Missing Output File Will be Created

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       final
        1       sortuniq
        2
rule sortuniq:
        input: creatininase.out
        output: creatininase.counts
1 of 2 steps (50%) done
localrule final:
        input: microbial.counts, creatininase.counts
2 of 2 steps (100%) done

Diagram of a Simple Workflow [snakemake -s sortuniq.rl --dag|dot -Tpng| display]

Useful for Publication

alt text

  • Created by Snakemake as a dot file
  • Need the dot Program to Convert to PNG
  • Can Edit the dot File to Add Annotations
  • dot Usually Available on Unix/Linux

Structured Parameters

Pipeline Parameters Can be Stored in Structured Files

  • Snakemake Configuration File
    • Structure is JSON (Javascript Object Notation)
    • Referenced within Snakefile
{        
    "data": ["microbial","creatininase"],
}

Structured Parameters: Use Within the Snakefile

The Input Identifiers are Stored in params.json

configfile: "params.json"

rule final:
    input: expand("{x}.counts",x=config['data'])

rule sortuniq:
    input: "{x}.out"
    output: "{x}.counts"
    shell: """
          cat {input}|cut -f2|sort|uniq -c > {output}
           """

Pipeliner: An Interface to Snakemake [/data/CCBR/apps/Pipeliner/runpipe.sh]

alt text

ssh -Y dwheeler@biowulf2

  • GUI Written in Python 3
  • Tkinter Widget Set
  • Tabbed Interface
    • Project Info, Parameter Selection, Program Json Viewer
    • Run Sequence, Job Monitor, Comments Editor, Manual
  • To Run on Biowulf2 Need X11 Client
    • Nor problem with Linux/Unix–X11 is there
    • For Mac Need to install XQuartz http://www.xquartz.org/
    • For Windows Need Cygwin or Xwin32

Pipeliner: Chosing Parameter Sets and the Pipeline to Run

alt text

  • The Annotation Set is Chosen
    • A Json file is selected that contains paths to reference files
    • References are self-consistent and comes in as a complete package
    • human, mouse, rat
    • various builds
  • The binary set is chosen
    • A Json file is selected that contains paths to programs
    • The program set is self-consistent and comes in as a complete package
  • The Pipeline is Chosen
    • Initialqc, exomeseq-germline, exomeseq-somatic, wgslow

Pipeliner: Setting up a Run

alt text

  • The Working Directory is Specified and Initialized
    • Some subdirectories are created and stocked with scripts
  • The Data Source is Specified
    • Symbolic Links are Made from the Working Directory to the Data
  • A Dry Run is Made to Detect Errors
  • The Pipeline is Submitted to the Batch Queueing System

Pipeliner: Results of a Dry Run

alt text

  • Every Job that will be Run is Listed
  • The Total Number of Jobs to be Run of Each Type is Listed
  • Missing Files will be Detected Here.
  • Errors in Pipeline Logic will be Detected

Pipeliner: The Initialqc and Exomeseq-germline Pipeline Diagrams

alt text

alt text

Pipeliner: The Json Containing the Full Specification for the Project

alt text

  • Can be Saved Within the GUI
  • Can be Loaded Within the GUI to Recreate an Analysis
  • Can be Edited Within the GUI to Tweak Parameters (hacking)

Slurm Queue: Two Qualimap Jobs Pending [squeue -u dwheeler]

The Master Pipeline Process and Two Pending Jobs are Shown

[dwheeler@biowulf exometalk]$ squeue -u dwheeler
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          15570447       ccr pl:quali dwheeler PD       0:00      1 (Resources)
          15570448       ccr pl:quali dwheeler PD       0:00      1 (Priority)
          15570258       ccr Pipeline dwheeler  R       2:59      1 cn0695

Pipeliner: Partial Log file

[Mon Mar  7 11:48:31 2016] Provided cluster nodes: 100
[Mon Mar  7 11:48:31 2016] Job counts:
    count   jobs
    1   all_initialqc
    2   fastqc_fastq
    2   fastqc_trimmed
    2   ngsqc
    2   novocraft_novoalign
    2   novocraft_sort
    2   picard_headers
    2   picard_markdups
    2   qualimap
    2   trimmomatic
    19
[Mon Mar  7 11:48:31 2016] rule fastqc_fastq:
    input: F23_10000.R1.fastq.gz, F23_10000.R2.fastq.gz
    output: QC/F23_10000.R1_fastqc.html, QC/F23_10000.R2_fastqc.html
    threads: 8
[Mon Mar  7 11:48:31 2016] export JAVA_OPTS='-Djava.io.tmpdir=/scratch'; /usr/local/apps/fastqc/0.11.2/fastqc -o QC -f fastq --threads 8 --contaminants /data/CCBR/dev/Pipeline/Pipeliner/Data/fastqc.adapters F23_10000.R1.fastq.gz F23_10000.R2.fastq.gz
[Mon Mar  7 11:48:32 2016] rule trimmomatic:
    input: F22_10000.R1.fastq.gz, F22_10000.R2.fastq.gz
    output: F22_10000.R1.trimmed.fastq.gz, F22_10000.R1.trimmed.unpair.fastq.gz, F22_10000.R2.trimmed.fastq.gz, F22_10000.R2.trimmed.unpair.fastq.gz
    threads: 4
[Mon Mar  7 11:48:32 2016] 
            java -Xmx8g -Djava.io.tmpdir=/scratch -jar /usr/local/apps/trimmomatic/Trimmomatic-0.33/trimmomatic-0.33.jar PE -threads 4 -phred33 F22_10000.R1.fastq.gz F22_10000.R2.fastq.gz F22_10000.R1.trimmed.fastq.gz F22_10000.R1.trimmed.unpair.fastq.gz F22_10000.R2.trimmed.fastq.gz F22_10000.R2.trimmed.unpair.fastq.gz ILLUMINACLIP:/data/CCBR/dev/Pipeline/Pipeliner/Data/adapters2.fa:3:30:10 LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:20

Pipeliner: Partial Annotation Json Defining Reference Files

 {"references": {
    "BWAGENOME": "/fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa",
    "GENOME": "/fdb/GATK_resource_bundle/b37/human_g1k_v37.fasta",
    "INDELSITES": "/fdb/GATK_resource_bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf", 
    "NOVOINDEX": "/data/CCBR/local/lib/human_g1k_v37_iupac.nix", 
    "REFFLAT": "/data/CCBR/local/lib/SS_exome.bed", 
    "SNPSITES": "/fdb/GATK_resource_bundle/b37/dbsnp_138.b37.vcf",
    "INDELSITES2":"/fdb/GATK_resource_bundle/b37/1000G_phase1.indels.b37.vcf",
    "ANNDIR": "/usr/local/apps/ANNOVAR/2014-11-12/humandb/",
    "tg_GS_INDELS": "/fdb/GATK_resource_bundle/hg19-2.8/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz",
    "tg_PHASE_INDELS": "/fdb/GATK_resource_bundle/hg19-2.8/1000G_phase1.indels.hg19.vcf.gz",
    "SNP138": "/fdb/GATK_resource_bundle/hg19-2.8/dbsnp_138.hg19.vcf.gz",
    "HAPMAP": "/fdb/GATK_resource_bundle/hg19-2.8/hapmap_3.3.hg19.vcf.gz",
    "B1K": "/fdb/GATK_resource_bundle/hg19-2.8/1000G_phase1.snps.high_confidence.hg19.vcf.gz",
    "OMNI": "/fdb/GATK_resource_bundle/hg19-2.8/1000G_omni2.5.hg19.vcf.gz", 
    "adapter.file": "/data/CCBR/dev/Pipeline/Pipeliner/Data/TruSeq_and_nextera_adapters.ngsqc.dat",

Pipeliner: Partial Binary/Script Json Defining Programs Used in Pipelines

{
     "bin": {
        "NOVOALIGN": "/usr/local/apps/novocraft/3.02.10/novoalign",
        "NOVOSORT": "/usr/local/apps/novocraft/3.02.10/novosort",
        "ANNOVAR1": "/usr/local/apps/ANNOVAR/2014-07-14/convert2annovar.pl", 
        "ANNOVAR2": "/usr/local/apps/ANNOVAR/2014-07-14/table_annovar.pl", 
        "INDEXBAM": "java -jar /usr/local/apps/picard/1.129/picard.jar BuildBamIndex", 
        "COVCALC": "QC/Coverage_calc4.pl", 
        "COVFREQ": "/data/CCBR/local/lib/Cov_Frequency_targeted.R",
        "GATK": "java -Xmx64g -Djava.io.tmpdir=/scratch -jar /usr/local/apps/GATK/3.3-0/GenomeAnalysisTK.jar", 
        "MARKDUPS": "java -Xmx16g -Djava.io.tmpdir=/scratch -jar /usr/local/apps/picard/1.129/picard.jar MarkDuplicates", 
        "PICARD1": "java -Djava.io.tmpdir=/scratch -jar /usr/local/apps/picard/1.129/picard.jar AddOrReplaceReadGroups", 
        "PICARD2": "java -Xmx4G -Djava.io.tmpdir=/scratch -jar /usr/local/apps/picard/1.129/picard.jar CollectInsertSizeMetrics", 
        "PICHIST": "/data/CCBR/local/lib/picardhist.R", 

Pipeliner: Partial Rules Json Defining Rules Belonging to Pipelines

 {
    "rules": {
        "ngsqc": ["initialqc","wgslow"],
        "fastqc.fastq": ["initialqc","wgslow"],
        "fastqc.trimmed": ["initialqc","wgslow"],
        "trimmomatic": ["initialqc","wgslow"],
    "samtools.flagstats": ["none"],
    "samtools.flagstats.dedup": ["none"],
        "qualimap": ["initialqc","wgslow"],
        "bwa.pe":["wgslow"],
        "novocraft.novoalign":["initialqc"],
        "bwa.index.ref":["none"],
        "annovar":["exomeseq-pairs"], 
        "script.checkqc":["exomeseq-pairs","exomeseq-somatic","exomeseq-germline","exomeseq-germline-recal","exomeseq-germline-partial"],
        "samtools.sam2bam":["initialqc","wgslow"], 
        "script.coverage.qc":["none"],

Pipeliner: A Modular Rule

The Snakefile is Built from a Modular Rule Library

rule picard_markdups:
     input:  "{x}.sorted.bam"
     output: out = temp("{x}.dedup.bam"),
             metrics = "{x}.sorted.txt"
     params: markdups=config['bin']['MARKDUPS']
     shell:  "{params.markdups} I={input} O={output.out} M={output.metrics} REMOVE_DUPLICATES=TRUE AS=TRUE PG='null'"

Pipeliner: Partial Snakefile for Exomeseq-germline Pipeline

import os
configfile: "run.json"
pairs=sorted(list(config['project']['pairs'].keys()))
rule all_exomeseq_germline:
    input:  "combined.gvcf",
             expand("all.{type}.dbnsfp.vcf", type=["snp","indel"])
    output: 
include: "/data/CCBR/apps/Pipeliner/Rules/gatk.combine.gvcfs.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/gatk.genotype.gvcfs.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/gatk.haplotype.caller.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/gatk.realign.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/picard.headers.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/picard.markdups.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/script.batchgvcf.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/script.checkqc.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/script.split.gvcfs.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/snpeff.rl"
include: "/data/CCBR/apps/Pipeliner/Rules/snpeff.dbnsfp.rl"

Pipeliner: Partial Project Json Fully Specifying an Analysis

{
    "project": {
        "analyst": "dwheeler",
        "annotation": "hg19",
        "batchsize": "20",
        "binset": "standard-bin",
        "bysample": "no",
        "cluster": "cluster_medium.json",
        "comments": "Enter comments here.\n",
        "custom": [],
        "datapath": "/data/CCBR/dev/PipelineTestSeqs/exomeseq/human/germline",
        "efiletype": "fastq",
        "filetype": "fastq.gz",
        "id": "Someid",
        "organism": "human",