Skip to content

Principal Components Analysis

Principal Components Analysis (PCA) transforms high dimensional data such as those derived from RNA sequencing so that researchers can see how study variables cluster together. The result of PCA is that the original data is projected onto a set a perpendicular axes where each axis accounts for a percentage of variance in the data. To learn the math behind PCA, see https://www.iro.umontreal.ca/~pift6080/H09/documents/papers/pca_tutorial.pdf.

PCA is an excellent quality assurance tool for RNA sequencing analysis as the results when plotted enable scientists to determine if samples in the same biological condition cluster together. Click on the "Normalized counts" data node and select exploratory analysis from the menu. From there, select PCA. In the subsequent PCA configuration page, lower the number of dimensions to 3 since a 3D plot is the most that can visualized.

Clicking on the PCA data node will reveal two plots and a table. First, there is an interactive three dimensional PCA plot where the axis PC1 (ie. principal component 1) accounts for the highest variance in the data (55.1%). As hoped, the normal and tumor samples are separated along this axis indicating that it is the biology (ie. normal or tumor) that differentiates the samples. The PC2 and PC3 account the second and third highest variance in the data and samples within each group are separated along these two axes suggesting that there may be differences between samples from the same condition or the existence of batch effects. Together, PC1, PC2, and PC3 explain 79% of the variation in this dataset. Here, with just 3 dimensions, scientists can visualize and interpret the data and thus, PCA is known a dimensionality reduction procedure as it reduces high dimensional data into the most relevant dimensions while enabling interpretation and conclusions to be drawn.

A scree plot showing the variance accounted for by each principal component axis is also available.

The table labeled "Component loadings" shows how the transcripts listed influence the separation of the samples along the three principal component axes.