Creating and modifying scatter plots: PCA and Volcano

Objectives

Learn how to make and modify scatter plots using ggplot2.
Learn how to visualize PCA results.
Learn how to create a Volcano plot.

Load the libraries

library(ggplot2)
library(dplyr)
library(DESeq2)

Other R packages used in this tutorial include patchwork, ggfortify, EnhancedVolcano, and ggrepel.

What is ggplot2?

ggplot2 is a popular R graphics package associated with a family of packages known as the tidyverse. Tidyverse packages work effectively on data stored in data frames (or tibbles), which store variables in columns and observations in rows. In ggplot2, plots are built in layers, allowing one to incorporate multiple data sets and advanced features, thus allowing the generation of fairly complex plots.

To build a plot, we need:

data to plot
one or more geoms - the visual representation of the plot.
mapping aesthetics - describe how the variables are mapped to the geoms

We can also include additional parameters:

facets - subplots
coordinate systems (default is Cartesian)
plot layout and rendering options (customize legends, lines, text, and other features).
statistical transformations / summarizations

The basic ggplot2 template:

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>
  ) +
  <FACET_FUNCTION> +
  <COORDINATE SYSTEM> +
  <THEME>

Example data

Here we will use bulk RNA-Seq data available in the R package airway, which is from an experiment published by Himes et al. (2014). These data, which are available in R as a RangedSummarizedExperiment object, are from a bulk RNAseq experiment. In the experiment, the authors "characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone," a common therapy for asthma. The airway package includes RNAseq count data from 8 airway smooth muscle cell samples. Each cell line includes a treated and untreated negative control.

We will use some normalized count data (rlog) and differential expression results processed according to the Bioconductor workflow, rnaseqGene. The normalized data is stored within a DESeqTransform object, and differential expression results are provided in a comma separated file. The data used in this tutorial is available for download here. See sessionInfo() at the end of this tutorial for information related to package versions.

#load R object
air<- readRDS("./Data/normalized_air.rds")

#read differential expression results
dexp<- read.csv("./Data/deseq2_DEGs.csv",row.names=1)
dexp<- dexp %>% filter(!is.na(padj))

Scatter plots

Scatterplots are useful for visualizing treatment–response comparisons, associations between variables, or paired data (e.g., a disease biomarker in several patients before and after treatment). - Holmes and Huber, 2021

Because scatter plots involve mapping each data point, the geom function used is geom_point().

Let's check out the basic scatter. We can look at the relationship of our normalized count data from all genes between samples.

ggplot(data=assay(air))+ #DATA
  geom_point(aes(x=SRR1039508,y=SRR1039512)) + #what's on x and y axis
  coord_fixed(xlim=c(0,20),ylim=c(0,20))

Here, we include the data we want to plot (assay(air)), a geom function to represent the plot (geom_point for a scatter plot), and mapping aesthetics assigning the x-axis to the genes from one sample (SRR1039508) and the y-axis to the genes from another sample (SRR1039512).

Note

The loaded gene counts include normalized data via a regularized-logarithm transformation (rlog). Read more on this here. This transformation effectively stabilizes the variance across the mean and is recommended for smaller data sets.

Common scatter plots used in genomics (PCA and Volcano)

PCA is mostly used in genomics for dimensionality reduction and uncovering patterns in highly dimensional data, including identifying technical and biological variation.

Examples of applications in genomics:

RNA-Seq
Batch correction.
Identifying Co-expressed Genes.
scRNA-Seq
Dimensionality reduction.
Evolutionary Genomics
Examining population structure.
Examining relationships between species.

Visualizing PCA also includes other types of plots

What we think of as a PCA plot, is usually a scatter plot of PC1 vs PC2 (sometimes called a biplot). However, PCA is multidimensional, and we may want to visualize other aspects (e.g., other PCs, degree of variance by PC (Scree plots), scores vs loadings (biplot)). For this, check out factoextra or PCAtools.

Volcano plots are used to visualize statistical significance versus magnitude of change (fold change) between treatments or conditions in large genomic data sets (e.g., RNA-Seq, ChIP-Seq). Volcano plots are great for identifying genes or other features for further exploration.

Note

There are many other scatter plots used in genomics. We are simply focusing on PCA and volcano in this tutorial.

What is PCA?

Principal component analysis (PCA) is an exploratory linear dimension reduction method applied to highly dimensional (multivariate) data. It is an usupervised learning technique that treats all variables equally. The goal of PCA is to reduce the dimensionality of the data by transforming the data in a way that maximizes the variance explained. Read more here and here.

There is an assumption that variables are highly correlated.

How does it work?

Data standardization - variables should be on the same scale (e.g., z-score transformation)
Calculate covariance- create a covariance matrix to understand the relationship between variables
Calculate eigenvectors (direction of variation) and eigenvalues (amount of variance) from the covariance matrix
Select the top k principal components
Transform the original data - obtain a representation of the data in the newly transformed space

Key points:

The PCs represent a linear combination of the original variables
PCs are uncorrelated
The number of PCs is equivalent to the number of original variables in your data set.
PC1 accounts for the most variance, and each subsequent PC will explain less and less variance.

For a more detailed explanation of PCA, see the following:

Perform PCA

There are many packages and functions available in R programming for performing PCA. Some of the most popular functions are stats::prcomp(), stats::princomp(), FactoMineR::PCA(), and ade4::dudi.pca(). These functions largely differ in the method used for PCA and the default arguments; see here. We will focus on stats::prcomp() for this tutorial. If you are using an alternative function, make sure you are aware of the function arguments (and defaults).

PCA is used frequently in -omics fields. Often than not, there will be package specific functions for PCA and plotting PCA for different -omics analyses.

Note

stats::princomp(), which uses eigenvalue decomposition of the covariance matrix, is faster to run than stats::prcomp(), which uses singular value decomposition (SVD) on the original data matrix.

We can use the function prcomp() to run PCA on our rlog transformed count data. This function requires numeric data. Notice by default, the function does not scale the data. If you have not transformed your data and you are working with variables of different units, consider scaling the data.

Why is scaling important?

Scaling and centering typically occur prior to PCA. Scaling removes biases from variables with high variances. Centering shifts the data to the origin by subtracting the mean of each feature. This is important for finding linear subspaces and removing the affect of the mean.

Large variance in observed variable will contribute most to the overall variance of computed principal component. If the observed values of variables have very different ranges, then the data needs to be normalized/scaled. - Aniket Patil

Setting a scaling parameter to TRUE will standardize the data (i.e., perform a z-score transformation) so that the data has a mean of 0 and standard deviation of 1.

Now we run prcomp():

#run PCA
pca<-prcomp(t(assay(air))) 

#get structure of df
str(pca)

List of 5
 $ sdev    : num [1:8] 23.28 17.78 14.84 12.08 7.12 ...
 $ rotation: num [1:16139, 1:8] -0.00677 0.00359 0.00056 -0.00108 0.01033 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:16139] "ENSG00000000003" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460" ...
  .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
 $ center  : Named num [1:16139] 9.45 9.02 7.89 5.83 12.45 ...
  ..- attr(*, "names")= chr [1:16139] "ENSG00000000003" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460" ...
 $ scale   : logi FALSE
 $ x       : num [1:8, 1:8] -25.8 14.1 -17.5 26.8 -22.4 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:8] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" ...
  .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
 - attr(*, "class")= chr "prcomp"

The object pca is a list of 5: the standard deviations of the principal components, a matrix of variable loadings, the centering and scaling parameters used, and the data projected on the principal components.

Plot PCA

To plot the first two axes of variation along with species information, we will need to make a data frame with this information. The axes are in pca$x.

#Build a data frame
pcaData <- as.data.frame(pca$x[, 1:3]) # extract first three PCs

# add the sample information
pcaData<- data.frame(colData(air)) |> select(cell, dex, Run) |> cbind(pcaData)

# modify column names
colnames(pcaData) <- c("Cell","Treatment","Sample","PC1","PC2","PC3") 

#Plot
ggplot(pcaData, aes(PC1, PC2, color = Treatment, shape = Cell)) +
  geom_point(size = 2) + # adding data points
  coord_fixed() # fixing coordinates

coord_fixed()

Tip 6, here, discusses the importance of aspect ratio and using coord_fixed() more in detail.

This is a decent plot showing us how our RNA-Seq samples are related based on gene expression. PC1 always explains the most variation followed by each successive PC. From this plot, we see that our samples cluster by treatment along PC1, while PC2 is capturing variation in the cell lines.

In the plot, the axes are missing the % explained variance. Let's add custom axes. We can do this with the xlab() and ylab() functions or the functions labs(). But first we need to grab some information from our PCA analysis. Let's use summary(pca). This function provides a summary of results for a variety of model fitting functions and methods.

#Extract % Variance Explained
summary(pca)

Importance of components:
                           PC1     PC2     PC3     PC4     PC5     PC6    PC7
Standard deviation     23.2778 17.7783 14.8386 12.0827 7.11804 5.66206 4.9081
Proportion of Variance  0.4071  0.2375  0.1654  0.1097 0.03807 0.02409 0.0181
Cumulative Proportion   0.4071  0.6446  0.8101  0.9197 0.95781 0.98190 1.0000
                            PC8
Standard deviation     1.09e-13
Proportion of Variance 0.00e+00
Cumulative Proportion  1.00e+00

PC1 and PC2 combined account for 65% of variance in the data. We can add this information directly to our plot using custom axes labels.

Add custom axes labels

#Plot
ggplot(pcaData, aes(PC1, PC2, color = Treatment, shape = Cell)) +
  geom_point(size = 2) + 
  coord_fixed() +
  xlab("PC1: 41%")+ #x axis label text
  ylab("PC2: 24%") # y axis label text

Automating the "Proportion of Variance"

If you want to automate the "Proportion of Variance", you should call it directly in the code. For example,

ggplot(pcaData, aes(PC1, PC2, color = Treatment, shape = Cell)) +
  geom_point(size = 2) + 
  coord_fixed() +
  labs(x=paste0("PC1: ",round(summary(pca)$importance[2,1]*100),"%"),
      y=paste0("PC2: ",round(summary(pca)$importance[2,2]*100),"%"))

Add PC3

To add multiple principal components, you can build multiple plots

a<-ggplot(pcaData, aes(PC1, PC2, color = Treatment, shape = Cell)) +
  geom_point(size = 2) + 
  coord_fixed() +
  labs(x=paste0("PC1: ",round(summary(pca)$importance[2,1]*100),"%"),
      y=paste0("PC2: ",round(summary(pca)$importance[2,2]*100),"%"))

b<-ggplot(pcaData, aes(PC3, PC2, color = Treatment, shape = Cell)) +
  geom_point(size = 2) + 
  coord_fixed() +
  labs(x=paste0("PC3: ",round(summary(pca)$importance[2,3]*100),"%"),
      y=paste0("PC2: ",round(summary(pca)$importance[2,2]*100),"%"))

c<-ggplot(pcaData, aes(PC1, PC3, color = Treatment, shape = Cell)) +
  geom_point(size = 2) + 
  coord_fixed() +
  labs(x=paste0("PC1: ",round(summary(pca)$importance[2,1]*100),"%"),
      y=paste0("PC3: ",round(summary(pca)$importance[2,3]*100),"%"))+
  theme(legend.position="none")


library(patchwork)

Warning: package 'patchwork' was built under R version 4.4.1

a + b + c + plot_layout(nrow=2,guides = 'collect')

A great package for simplifying this is PCAtools, using the function pairsplot(). This package has other functionality that also may be of interest, and it plays nicely with ggplot2 functions for customization.

Add a stat to our plot with `stat_ellipse()`.

In scatter plots, the raw data is the focus of the plot, but for many other plots, this is not the case. You may wish to overlay a stat on your PCA. For example, ellipses are often added to PCA ordinations to emphasize group clustering with confidence intervals. By default, stat_ellipse() uses the bivariate t distribution, but this can be modified. Let's add ellipses with 95% confidence intervals to our plot.

ggplot(pcaData, aes(PC1, PC2, color = Treatment, shape = Cell)) +
  geom_point(size = 2) + 
  coord_fixed() +
  labs(x=paste0("PC1: ",round(summary(pca)$importance[2,1]*100),"%"),
       y=paste0("PC2: ",round(summary(pca)$importance[2,2]*100),"%"))+
  stat_ellipse(geom="polygon",aes(group=Treatment), 
              level=0.95, alpha=0.2) #adding a stat

Warning: The following aesthetics were dropped during statistical transformation: shape.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Using ggfortify

ggfortify is an excellent package to consider for easily generating PCA plots. ggfortify provides "unified plotting tools for statistics commonly used, such as GLM, time series, PCA families, clustering and survival analysis. The package offers a single plotting interface for these analysis results and plots in a unified style using 'ggplot2'."

library(ggfortify)
autoplot(pca, data = colData(air), colour = 'dex',shape='cell',size=2) +
  coord_fixed()

Since this is a ggplot2 object, this can easily be customized by adding ggplot2 customization layers.

Using DESeq2 function

DESeq2 has its own function for PCA analysis, plotPCA(). If you check out the source code for this function (getMethod("plotPCA","DESeqTransform")), you will see that the data is filtered prior to prcomp() to only include the 500 top most variable genes by default.

If you are interesting in examining the PC loadings, these are difficult to plot using ggplot2 alone. I recommend PCAtools or factoextra.

Plot Customization: Using themes

To create a publication quality plot, you will need to make several modifications to your basic PCA biplot code. We have already seen how to modify the default coordinate system, how to add additional statistics (e.g., ellipses with confidence intervals), and how to modify the axes labels. There are many more features that can be customized to make this publishable or fit a desired style.Changing non-data elements (related to axes, titles subtitles, gridlines, legends, etc.) of our plot can be done with theme(). GGplot2 has a definitive default style that falls under one of their precooked themes, theme_gray(). theme_gray() is one of eight complete themes provided by ggplot2.

**ggplot2 built-in themes**

We can also specify and build a theme within our plot code or develop a custom theme to be reused across multiple plots. The theme function is the bread and butter of plot customization. Check out ?ggplot2::theme() for a list of available parameters. There are many.

Let's see how this works by changing the fonts and text sizes and dropping minor grid lines:

ggplot(pcaData, aes(PC1, PC2, color = Treatment, shape = Cell)) +
  geom_point(size = 2) + 
  coord_fixed() +
  labs(x=paste0("PC1: ",round(summary(pca)$importance[2,1]*100),"%"),
       y=paste0("PC2: ",round(summary(pca)$importance[2,2]*100),"%"))+
  stat_ellipse(geom="polygon",aes(group=Treatment), level=0.95, alpha=0.2)+
theme_bw() + #start with a custom theme 
  theme(axis.text=element_text(size=12,family="Times New Roman"),
        axis.title = element_text(size=12,family="Times New Roman"),
        legend.text = element_text(size=12,family="Times New Roman"),
        legend.title = element_text(size=12,family="Times New Roman"),
        panel.grid.minor = element_blank())

Warning: The following aesthetics were dropped during statistical transformation: shape.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

You may want to establish a custom theme for reuse with a number of plots. See this great tutorial by Madeline Pickens for steps on how to do that.

A helpful color trick

When you have a lot of colors and you want to keep these colors consistent, you can use the following convenient functions to set a name attribute for a vector of colors.

Let's do this for our asthma treatments.

#defining colors
gcolors<-setNames(c("black","orange"),levels(pcaData$Treatment))  

#Now plot
ggplot(pcaData, aes(PC1, PC2, color = Treatment, shape = Cell)) +
  geom_point(size = 2) + 
scale_color_manual(values=gcolors)+ #Adding the color argument
  coord_fixed() +
  labs(x=paste0("PC1: ",round(summary(pca)$importance[2,1]*100),"%"),
      y=paste0("PC2: ",round(summary(pca)$importance[2,2]*100),"%"))+
  stat_ellipse(geom="polygon",aes(group=Treatment,fill=Treatment), 
              level=0.95, alpha=0.2)+
  scale_fill_manual(values=gcolors)+
theme_bw() + #start with a custom theme 
  theme(axis.text=element_text(size=12,family="Times New Roman"),
        axis.title = element_text(size=12,family="Times New Roman"),
        legend.text = element_text(size=12,family="Times New Roman"),
        legend.title = element_text(size=12,family="Times New Roman"),
        panel.grid.minor = element_blank())

Warning: The following aesthetics were dropped during statistical transformation: shape.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

We can use this color palette for all plots including these two treatments to keep our figures consistent throughout a presentation or publication.

Creating a publication ready volcano plot

Now that we know how to create a PCA biplot, let's use what we have learned to also make a volcano plot.

A volcano plot is a type of scatterplot that shows statistical significance (P value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes. --- Maria Doyle, 2021

First, we need to do a bit of data wrangling. The tidyverse packages work great for this task.

#Wrangle the data

#add gene names to differential expression results
gene_names<- data.frame(rowData(air)) |> select(gene_name) |> tibble::rownames_to_column("ID")

dexp<- dexp |> tibble::rownames_to_column("ID") |> 
      left_join(gene_names, by="ID")

# Call genes significantly differentially expressed if they have a 
#p-value less than 0.05 and a logFC greater than or equal to 2.
dexp_sigtrnsc <- dexp |> 
  mutate(Significant = padj < 0.05 & abs(log2FoldChange) >= 2) |>
  arrange(padj)

#extract top 6 genes to use for the labels
topgenes <- dexp_sigtrnsc$gene_name[c(1:6)]

Let's take a quick look at the wrangled data and our top genes for later labeling:

head(dexp_sigtrnsc)

               ID   baseMean log2FoldChange     lfcSE        pvalue
1 ENSG00000152583   997.9612       4.555676 0.1858858 7.129668e-136
2 ENSG00000165995   495.5675       3.276456 0.1326038 5.859407e-136
3 ENSG00000120129  3411.9661       2.935137 0.1227877 2.414560e-128
4 ENSG00000101347 12712.9456       3.754493 0.1579042 7.501584e-128
5 ENSG00000189221  2343.5733       3.337353 0.1432593 1.140346e-122
6 ENSG00000211445 12298.1966       3.711558 0.1693331 5.201708e-110
           padj gene_name Significant
1 5.530127e-132   SPARCL1        TRUE
2 5.530127e-132    CACNB2        TRUE
3 1.248569e-124     DUSP1        TRUE
4 2.909302e-124    SAMHD1        TRUE
5 3.538038e-119      MAOA        TRUE
6 1.344901e-106      GPX3        TRUE

topgenes

[1] "SPARCL1" "CACNB2"  "DUSP1"   "SAMHD1"  "MAOA"    "GPX3"

Significant differential expression was assigned based on an absolute log fold change greater than or equal to 2 and an FDR corrected p-value less than 0.05.

Enhanced Volcano

There is a dedicated package for creating volcano plots available on Bioconductor, EnhancedVolcano. Plots created using this package can be customized using ggplot2 functions and syntax.

Using EnhancedVolcano:

#The default cut-off for log2FC is >|2|
#the default cut-off for P value is 10e-6
library(EnhancedVolcano)

Loading required package: ggrepel

Warning: package 'ggrepel' was built under R version 4.4.1

EnhancedVolcano(dexp,
                title = "Enhanced Volcano with Airways",
                lab = dexp_sigtrnsc$gene_name,
                x = 'log2FoldChange',
                y = 'padj')

If you are interested in using this package, check out this tutorial.

Let's start our plot with the <DATA>, <GEOM_FUNCTION>, and <MAPPING>. We do not need to fix the coordinate system because we are working with two different values on the x and y and we don't need any special coordinate system modifications. Let's plot logFC on the x axis and the mutated column with our adjusted p-values on the y-axis and set the significant p-values off from the non-significant by color. We can also go ahead and customize the color scale, since we have learned how to do that.

ggplot(data=dexp_sigtrnsc) +
  geom_point(aes(x = log2FoldChange, y = log10(padj), color = Significant,
                  alpha = Significant)) +
  scale_color_manual(values = c("black", "#e11f28"))

Warning: Using alpha for a discrete variable is not advised.

This is not exactly what we want so let's keep working.

Immediately, you should notice that the figure is upside down compared to what we would expect from a volcano plot. There are two possible ways to fix this. We could transform the adjusted p-values by multiplying by -1 OR we could work with our axes scales. Aside from text modifications, we haven't yet changed the scaling of the axes. Let's see how we can modify the scale of the y-axis.

Changing axes scales

To change the y axis scale, we will need a specific function. These functions generally start with scale_y.... In our case we want to reverse our axis so that increasingly negative is going in the positive direction rather than the negative direction. Luckily, there is a function to reverse our axis; see ?scale_y_reverse().

ggplot(data=dexp_sigtrnsc) +
  geom_point(aes(x = log2FoldChange, y = log10(padj), color = Significant, 
                  alpha = Significant)) +
  scale_color_manual(values = c("black", "#e11f28")) +
  scale_y_reverse(limits=c(0,-150))

Warning: Using alpha for a discrete variable is not advised.

This looks pretty good, but we can tidy it up more by working with our legend guides and our theme.

Modifying legends

We can modify many aspects of the figure legend using the function guide(). Let's see how that works and go ahead and customize some theme arguments. Notice that the legend position is specified in theme().

ggplot(data=dexp_sigtrnsc) +
  geom_point(aes(x = log2FoldChange, y = log10(padj), color = Significant, 
                  alpha = Significant)) +
  scale_color_manual(values = c("black", "#e11f28")) +
  scale_y_reverse(limits=c(0,-150))+
  guides(alpha= "none", 
         color= guide_legend(title="Significant DE")) +
       theme_bw() +
      theme(
        panel.border = element_blank(),
        axis.line = element_line(),
        panel.grid.major = element_line(size = 0.2),
        panel.grid.minor = element_line(size = 0.1),
        text = element_text(size = 12),
        legend.position = "bottom",
        axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1)
      )

Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

Warning: Using alpha for a discrete variable is not advised.

Lastly, let's layer another geom function to label our top six differentially abundant genes based on significance. We can use geom_text_repel() from library(ggrepel), which is a variation on geom_text().

ggplot(data=dexp_sigtrnsc) +
  geom_point(aes(x = log2FoldChange, y = log10(padj), color = Significant, 
                  alpha = Significant)) +
  geom_text_repel(data=dexp_sigtrnsc %>% 
                    filter(gene_name %in% topgenes), 
                  aes(x = log2FoldChange, y = log10(padj),label=gene_name),
                  nudge_y=0.5,hjust=0.5,direction="both",
                  segment.color="gray")+
  scale_color_manual(values = c("black", "#e11f28")) +
  scale_y_reverse(limits=c(0,-150))+
  guides(alpha= "none", 
         color= guide_legend(title="Significant DE")) +
       theme_bw() +
      theme(
        panel.border = element_blank(),
        axis.line = element_line(),
        panel.grid.major = element_line(size = 0.2),
        panel.grid.minor = element_line(size = 0.1),
        text = element_text(size = 12),
        legend.position = "bottom",
        axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1)
      )

Warning: Using alpha for a discrete variable is not advised.

To save the file, use ggsave().

Session Info

sessionInfo()

R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.7.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] EnhancedVolcano_1.22.0      ggrepel_0.9.6              
 [3] ggfortify_0.4.17            patchwork_1.3.0            
 [5] DESeq2_1.44.0               SummarizedExperiment_1.34.0
 [7] Biobase_2.64.0              MatrixGenerics_1.16.0      
 [9] matrixStats_1.4.1           GenomicRanges_1.56.2       
[11] GenomeInfoDb_1.40.1         IRanges_2.38.1             
[13] S4Vectors_0.42.1            BiocGenerics_0.50.0        
[15] dplyr_1.1.4                 ggplot2_3.5.1              

loaded via a namespace (and not attached):
 [1] gtable_0.3.6            xfun_0.49               htmlwidgets_1.6.4      
 [4] lattice_0.22-6          vctrs_0.6.5             tools_4.4.0            
 [7] generics_0.1.3          parallel_4.4.0          tibble_3.2.1           
[10] pkgconfig_2.0.3         Matrix_1.7-1            lifecycle_1.0.4        
[13] GenomeInfoDbData_1.2.12 compiler_4.4.0          farver_2.1.2           
[16] stringr_1.5.1           munsell_0.5.1           codetools_0.2-20       
[19] htmltools_0.5.8.1       yaml_2.3.10             pillar_1.10.0          
[22] crayon_1.5.3            tidyr_1.3.1             MASS_7.3-61            
[25] BiocParallel_1.38.0     DelayedArray_0.30.1     abind_1.4-8            
[28] tidyselect_1.2.1        locfit_1.5-9.10         digest_0.6.37          
[31] stringi_1.8.4           purrr_1.0.2             labeling_0.4.3         
[34] fastmap_1.2.0           grid_4.4.0              colorspace_2.1-1       
[37] cli_3.6.3               SparseArray_1.4.8       magrittr_2.0.3         
[40] S4Arrays_1.4.1          withr_3.0.2             scales_1.3.0           
[43] UCSC.utils_1.0.0        rmarkdown_2.29          XVector_0.44.0         
[46] httr_1.4.7              gridExtra_2.3           evaluate_1.0.1         
[49] knitr_1.49              rlang_1.1.4             Rcpp_1.0.13-1          
[52] glue_1.8.0              rstudioapi_0.17.1       jsonlite_1.8.9         
[55] R6_2.5.1                zlibbioc_1.50.0

References

Sources for content included R4DS and Holmes and Huber, 2021.

Creating and modifying scatter plots: PCA and Volcano

Objectives

Load the libraries

What is ggplot2?

Example data

Scatter plots

Common scatter plots used in genomics (PCA and Volcano)

What is PCA?

Perform PCA

Plot PCA

Add custom axes labels

Add a stat to our plot with stat_ellipse().

Using ggfortify

Plot Customization: Using themes

Creating a publication ready volcano plot

Changing axes scales

Modifying legends

Session Info

References

Add a stat to our plot with `stat_ellipse()`.