Bulk RNA Sequencing Analysis Using Partek Flow
Joe Wu, PhD NCI CCR Bioinformatics Training and Education Program ncibtep@nih.gov
What is and why use Partek Flow
- Partek Flow
- Point-and-click bioinformatic software enabling biologists to create workflows for analyzing high throughput sequencing data including:
- DNA
- Bulk and single cell RNA, ATAC/ChIP
- CITE
- Spatial transcriptomics.
- Hosted on Biowulf so provides users more compute resources for analyzing large genomic data.
- Getting started with Partek Flow at NIH
- Institutional licenses available for NCI, NHGRI, NIH Library.
Class expectation
- Participants will have an understanding of how to construct a bulk RNA analysis work flow after this class, ranging from file import to differential expression analysis and construction of visualizations. This class will not turn the audience into experts.
- Mention the Partek Flow bulk and single cell RNA training offered at through the NIH library in December.
- Going to assume that we have our Partek Flow account setup and data transferred to the PF server already (see Getting started with Partek Flow at NIH to learn about the different options for getting your data to the server)s
Create new project and import data
- Click on the "Add project" tab to create a new analysis project.
- Import data:
- Partek Flow handles many data types but for this class, we will select bulk and then RNA.
- Partek Flow also allows users to start anywhere (want to start with a BAM file, you can do that!).
- As data is importing, we will see a light blue task node. After import is complete, we see a circular data node.
Pre-alignment QC
- QC all reads
- K-mer length, if specified will generate a report for each sample of the positions for the most commonly occurring k-mers (or sequence of nucleotides) of the specified length - can hint at enrichment (maybe adapters)
- Summary table
- Sample names (click to access sample-level report)
- Quality - likelihood of error in sequencing (all samples a quality score of 38, which indicates a 0.0158% error likelihood)
- Essentially no unknown reads as indicated by the "%N" column
- "Average base quality score per position" plot shows the average quality at each position for all reads/sequences in a sample
- Quality score distribution
Adapter trimming
- Adapter available from file
- Trim from both sides
- Run pre-alignment QC again after this step to make sure the trimming step did not affect the data - quality still great after trimming although the average read length per sample was reduced (due to trimming)
Map to hg38 chromosome 22
- RNA sequencing requires a splice aware aligner to accommodate reads that map across exons and
- STAR
- HISAT2 (will use this here)
- HISAT2 index the reference genome prior to alignment to make it more efficient. If your index is not available in the menu then scroll down and choose "New assembly" to add it.
- Run HISAT2 with defaults on adapter trimmed reads although users can adjust alignment stringency such as mismatch penalty under "Configure" next to advanced options.
Post alignment QC
- Ensure that alignment went without issues.
- Alignment rate
- Unique mappers
- Mapping quality (a mapping quality of 60 corresponds to 0.0001% error)
Quantifying expression
- Use Partek Quantification to Model (E/M) algorithm since a gtf annotation is available
- Uses statistics to assign expression to multi-mappers rather than discarding them
- Output includes gene and transcript level expression quantifications
- Summary table indicating the percentage of reads that mapped to features such as exons and introns.
- Count distribution table showing minimum, maximum, median, 25th (Q1), and 75th (Q3) - these are also available as visualizations (box and whiskers plot as well as density plot)
Normalization
- Remove technical variants while keep biological differences
- Will use median ratio for DESeq2, this removes variations from
- Differing sequencing depth per sample
- Variations in RNA composition between biological conditions
- Post normalization report shows distribution table and before/after box and density plots
- Both gene and transcript level expression estimates will be normalized
Filtering counts
- Filter the normalized transcript expression to excluded those whose sum across all samples is less than or equal to 3 (low expression genes or transcript may represent noise).
Differential expression
- Use Partek's implementation of DESeq2.
- Assign tumor as the numerator and normal as the denominator so that the expression ratio is calculate as average expression for tumor/average expression for normal