ncibtep@nih.gov

Bioinformatics Training and Education Program

BTEP: Data Science Using Apache Spark for Biomedical Applications

BTEP: Data Science Using Apache Spark for Biomedical Applications

 When: Nov. 4th, 2019 11:00 am - 3:30 pm

This class has ended.
To Know
  • Where: Bldg 60, Rathskeller
  • Organized By: BTEP
  • Presented By: James Stratton (DataBricks), Frank Nothaft (DataBricks)
  • Files

About this Class

This is a hands-on demo, please bring your laptop or let us know if you need to borrow one.  The field of genomics has matured to a stage where organizations are sequencing DNA at population scale. However, taking raw DNAseq data and transforming it into a format suitable for analysis has become the new bottleneck to genomic discovery. Typically, teams are gluing together a series of bioinformatics tools with custom scripts and processing data on single node machines, one sample at a time. Bioinformatics scientists are spending more time building and maintaining pipelines than modeling data. To ease the burden of analyzing population scale genomic data, a number of open-source bioinformatics tools have moved to use Apache Spark™, such as the GATK4, Hail, and ADAM, but mastering these tools is no easy task. In this workshop, we’ll walkthrough how the Databricks Unified Analytics Platform for Genomics simplifies the end-to-end process of turning raw sequencing data into actionable insights at scale. Introduced by the original creators of Apache Spark, this platform makes it simple to deploy Spark-based bioinformatics tools on cloud computing, and rapidly accelerates common genomic analyses. Join this half day technical workshop to learn how to
  • Call variants, both in a single sample and across multiple samples, using our accelerated GATK4 pipelines
  • Use Spark SQL to characterize the association of variants in a population with phenotypes
  • Use machine learning to model genome-wide disease risk across multiple variants associated with a phenotype of interest
Key technologies employed: GATK4/Variant calling, Genotype-phenotype association tests, population scale risk-modeling via ML, ML model training/deployment AGENDA AT A GLANCE 11:00-11:45    Introduction and Opening Remarks 12:30-1:30       Workshop #1: Accelerating Variant Calls with Apache Spark 1:30-2:30         Workshop #2: Characterizing Genetic Variants with Spark SQL 2:30-3:30         Workshop #3: Disease Risk Scoring with Machine Learning   If you are unable to attend in person, WebEx will be provided: Event address for attendees: https://cbiit.webex.com/cbiit/onstage/g.php?MTID=eeb4e34e8558861862b5a716bb88c6c73 WebEx recording available at: https://cbiit.webex.com/recordingservice/sites/cbiit/recording/play/92d05e03bf1a4882b92ec73eab7a85a5    

Files

  • NIH-Genomics-Workshop-11_4_2019.pdf: |