Skip to content

Generate Gene Expression Counts

Gene expression table can be generated from the read alignment. Options for generating an expression table. Because there a GTF annotation file is avaialable, this exercise will use the Quantify to annotation model (Partek E/M) tool, although others are available (see Partek Flow documents to learn more).

Quantify to annotation model (Partek E/M) uses a statistical algorithm to determine how to assign multi-mapping reads to genomic features and avoids discarding these reads. When running this module, make sure that the "Strict paired-end compatibility" and "Require junction reads to match introns" options are checked.

  • "Strict paired-end compatibility": In the case of paired end sequencing, this option tells Partek Flow to count only when both reads in the pair align to a transcript.
  • "Require junction reads to match introns": This options deals with scenarios 3 and 4 in the image below where a part of the read maps to the intron. When checked, this option counts only when the intronic portion of the read matches the intron on the reference.

Source: https://documentation.partek.com/display/FLOWDOC/Understanding+Reads+in+RNA-Seq+Analysis

Users can also control the amount of overlap that a read has to a genomic feature (ie. gene, transcript) for it to count. Finally, the "Filter features" option allow users to remove genes or transcripts where the read counts across all samples are less than a specified threshold. This helps with filtering out low expression genes.

Under advanced options, select "auto-detect" if users do not know the strand specificity of the RNA sequencing experiment protocol.

Warning

Specifying the corrected strandedness used in a RNA sequencing experiment helps to avoid miscounting or counting the wrong gene or transcript. See https://chipster.csc.fi/manual/library-type-summary.html to learn more.

The quantification step generates gene-level and transcript-level expression estimates, thus two data nodes appear upon completion of this task. These exercise will use the transcript-level data for differential expression analysis and gene-level data for GSEA. Clicking on the gene-level expression data node will pull up a summary about the quantification. The first tables shows the percentages of reads that overlapped exons, introns, and intergenenic regions, etc (click on the icon under the "View" column to view the break down of overlap types for each sample).

The information in this summary table is also presented as stacked bar chart as shown below.

The next table shows the expression distribution information such as min, max, mean, median, 25th percentile (Q1), and 75th percentile (Q3).

A histogram showing the distribution of expression estimates is available as well. Across all samples, most genes have expression counts of between 0-100 although there are some high expressing genes that have counts of between 1000-10000.

The distribution of expression estimates for samples in this dataset are shown as box and density plots.