Core Facilities: Data pre-processing and data returning policies

Core Facilities

There are a number of core facilities available to NCI researchers. See more information from the Office of Science and Technology Resources.

We most commonly see data from the following cores:

CCR Sequencing Facility (CCR-SF) - located at the ATRF in Frederick, MD. This core is dedicated to high throughput sequencing.
- For large scale projects and production ready projects (compare with NCI CCR Genomics Core)
  Summary of Technologies
  - Illumina Short Read Sequencing
    - ChIP-Seq
    - Cut and Run
    - ATAC-Seq (only for pilot projects)
    - RNA-Seq (mRNA, Total RNA and microRNA)
    - Whole Genome Sequencing
    - Whole Exome Sequencing
    - Methylated DNA sequencing (bisulfite)
    - Amplicon Sequencing
  - Long reads / PacBio Sequencing
    - Whole Genome Sequencing
    - RNA Sequencing
    - Targeted Sequencing
    - HLA Typing
    - 16S sequencing
  - Short read and long read protocols for single cell
  - Optical mapping with Bionano Genomics
NCI CCR Single Cell Analysis Facility (SCAF) - located on the NIH Bethesda main campus, building 41, and provides advanced single-cell genomics technologies.
- Primarily for CCR researchers on the Bethesda campus.
  Summary of Technologies
  - 10X Genomics Chromium system
  - Advanced Methods: Plate-based single cell approaches (e.g., Smart-Seq2)
  - See the SCAF webpage for information on emerging technologies
NCI CCR Genomics Core - located on the NIH Bethesda main campus, building 41.
- Rapid turnover for smaller projects (compare with CCR-SF)
  Summary of Technologies
  - Next Generation Sequencing (iSeq 100, MiSeq, NextSeq 550 and the NextSeq 2000)
    - Applications include targeted gene sequencing (amplicon and targeted enrichment), metagenomics, gene expression studies, ChIP-Seq and RNA-Seq
  - Sanger Sequencing
  - Digital Gene Expression
  - Digital droplet PCR
  - Analytical / Preparative electrophoresis
  - Automation
  - NanoString GeoMX DSP
  - Oxford Nanopore MinION

Data from these cores will likely undergo some form of pre-processing. Additionally, cores may return data to the user in different ways. See below for current core protocols.

Core data pre-processing protocols

CCR-SF

For all projects, CCR-SF conducts primary and secondary analyses including initial base-calling, demultiplexing, data quality control, and reference genome alignment of NGS reads. Tertiary analyses may also be conducted on a project by project basis. For more information, refer to the CCR-SF FAQs.

SCAF

For a standard 10x assay against a standard reference, you can expect the raw sequencing data to be processed through to the Genomics cellranger output, including all quality control steps and troubleshooting in between. Otherwise, the degree of bioinformatic support will vary based on the project and individual needs. Non-standard projects generally require the development of a custom data processing workflow. As such, SCAF will conduct base-level analyses to ensure assay performance. In limited cases, the SCAF will also perform secondary analysis steps including bioinformatic analysis, interpretation, figure generation, and dataset submission.

NCI CCR Genomics Core

For NGS data, the NCI CCR Genomics Core will generate fastq files and initial QC metrics (if requested).

In addition,

The Core has a dedicated bioinformatics consultant who advises customers on appropriate experimental design, interpretation of QC data and helps to direct users to the existing bioinformatics tools under CCBR and other available bioinformatic entities. --- NCI CCR Genomics Core

How will my data be returned to me?

CCR-SF

For information on how data is returned from CCR-SF, refer to the sequencing facility FAQs: How are the data files delivered?

SCAF

Data is returned from the SCAF via a Globus share link.

NCI CCR Genomics Core

According to the NCI CCR Genomics Core website

Next gen sequencing data will be delivered via pre-signed URLs in the form of a .tar or .zip archive containing all fastq files as well as a package containing QC metrics. A .tar archive of the entire run directory can also be delivered upon request. All preassigned-URLs are valid for one week from the delivery date.

All delivered NGS data will be uploaded for long-term storage on your behalf to the NCI Data Vault. All project data (both raw and processed) will be stored for a period of two years from the run completion date. Please backup and store your project data within this timeframe. While it is possible that project data may retrieved after this time frame, we cannot guarantee that all raw files will be available.
For information about the NCI data vault visit https://wiki.nci.nih.gov/display/DMEdoc

Understanding QA/QC reports

QA / QC reports are generated from programs such as fastqc and multiqc.

FastQC runs several quality checks on raw NGS data to give you a general idea regarding the overall quality of your data. FastQC will generate a report for each sample.

On the other hand, Multiqc can be used to parse and aggregate summary information from a number of bioinformatic tools into a single report. In our example below, we have simply used Multiqc to combine summary information from fastqc from all samples into a single report, but you can also combine log files and output from other steps in your bioinformatic workflow, for example, following quality trimming with tools such as trimmomatic and cutadapt.

`fastqc`

Note that each section of the report is marked by color coded flags (i.e., green, yellow, red). Yellow and red flags, which indicate "warning" and "fail" respectively, may indicate a problem with the quality of your data. Such flags suggest that you should take a closer look at the data, but whether they represent an actual quality issue is contextually dependent and based on your experiment.

Let's break down some of the components of this report.

Basic Statistics

Includes general summary information. You should note the "Total Sequences", "Sequence length", and "%GC". Are these what you expect?

Per Base Sequence Quality
Includes a box and whisker plot summarizing quality scores information for all sequences in a sample at each base pair position. The blue line tracks the mean quality score.

There may be lower quality scores across the first few positions and you will likely see a general decline in quality with the length of the read. In general, greater than 28 indicates high quality reads.

Per Tile Sequence Quality

This plot only appears if Illumina headers are retained. It allows you to assess quality across the flowcell. We want this plot to stay fairly blue across all tiles. The blue colors indicate "where the quality was at or above the average for that base in the run", whereas warmer colors indicate a decrease in quality for a tile compared to other tiles for that base. If there are warmer colors throughout, there may have been a problem with the Illumina flowcell.

Though we have a warning for our example fastqc report, overall the per tile sequence quality looks fine. See the linked fastqc documentation for an example of a bad plot.

Per Sequence Quality Scores

This plot shows the quantity of sequences associated with a given mean quality score. Ideally we want the majority of our reads to be of high quality, so we would expect a peak toward the right of the plot with no major peaks at lower quality scores.

The per sequence quality scores look fantastic for this sample.

Per Base Sequence Content

In a random library you would expect that there would be little to no difference between the different bases of a sequence run, so the lines in this plot should run parallel with each other. --- fastqc documentation

However, this quality check often fails for RNAseq data:

This is because the first 10-12 bases result from the ‘random’ hexamer priming that occurs during RNA-seq library preparation. This priming is not as random as we might hope giving an enrichment in particular bases for these intial nucleotides. --- hbctraining

Per Sequence GC Content

The per sequence GC content should demonstrate a normal distribution. The peak should match the underlying GC content from your genome of interest. Biases here could indicate a contaminated library.

Per Base N Content

An 'N' base call results when the sequencer cannot confidently determine the base. There may be a low number of Ns throughout your sequences. This is only a concern if the proportion of Ns is significantly high. Though, you will likely see flags before this point if that is the case.

Sequence Length Distribution

This shows the number of sequences by sequence length. Variation here will be contingent upon the sequencing platform from which your sequences derived.

Sequence Duplication Levels

Sequence duplication levels are based on a subset of the first 100k sequences. This check is looking for exact sequences and so even high read coverage wouldn't necessarily result in exact sequences across a given region.

High duplication could result from:

Low library diversity
Vector or adaptor contamination
Low level of duplication with small spike at 10 bin may occur for RNAseq projects
- This is due to greatly oversequencing high copy genes to represent low copy genes.

The sequence duplication levels can be paired with the overrepresented sequences to determine the source of duplication.

Overrepresented Sequences

This module lists all of the sequence which make up more than 0.1% of the total. --- fastqc documentation

Based on our example file, it is worth making sure that all adapters have been removed from our sequences.

Adapter Content

This quality check looks for uneven kmer coverage across the length of your sequences.

Note: at times the adapter content and overrepresented sequences do not agree. If one or both point to adapter contamination, you should consider adapter trimming.

`multiqc`

In this example, multiqc is simply aggregating results from fastqc. This allows us to compare the overall quality of our entire sequencing run.