Normalizing Gene Expression Estimation
Normalization of gene expression estimates obtained from the quantification step is important as this will remove technical or non-biological variants in the data such as:
- Differences in sequencing depth between samples (ie. not all samples have the same number of reads or sequences).
- RNA composition variations among samples (ie. samples do not have the same RNA expressed in the case where comparison of transcriptome is done between tissue from different organs or perhaps differing biological conditions such as tumor versus normal).
- Gene length (longer genes will have more reads mapping to them).
- GC content.
Ultimately, the goal is the eliminate technical variations in the sequencing experiment so that the scientist can be left with the biological variations, which are of interest.
When doing differential expression analysis between biological conditions, only the first two technical variations mentioned above are important. This is because it can be assumed that when comparing expression of the same gene or transcript between conditions, that the length and GC content would remain the same. While there are many normalization techniques available in Partek Flow (see https://documentation.partek.com/display/FLOWDOC/Normalization), this class will use the median ratio (DESeq2 only) method as it will normalize for sequencing depth and RNA composition.
Click on the transcript normalized estimates data node to view the summary for this step in the analysis.
The distribution, minimum, maximum, mean, median, 25th percentile (Q1), and 75th percentile (Q3) of expression estimates are presented as a table as well as box and density plots. Note that users can compare between the pre- and post- normalized expression count box and density plots. Normalization resulted in expression estimate distribution for all samples to roughly overlap.