From Data to Display: Crafting a Publishable Plot

Learning Objectives

Integrate previously learned ggplot2skills including data mapping, geoms, labels, scales, and themes to construct a complete visualization workflow.
Design and produce a polished, publication-ready plot from start to finish, making informed choices about plot type, aesthetics, and formatting.

For this exercise, we are going to use the information we have learned to create a volcano plot of our differential expression results.

Warning

This lesson requires audience participation.

Try not to cheat. Attempt to add the necessary code without referring to the documentation. To help you with this, code blocks are collapsed to hide the code.

A volcano plot is a type of scatterplot that shows statistical significance (P-value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes. --- Maria Doyle, 2021

To generate a volcano plot, we need to know which genes were differentially expressed. Differential expression results can be obtained using a number of R packages (e.g., limma, edgeR, DESeq2). For today's lesson, we are using output generated from edgeR and available in the file, "./data/diffexp_results_edger_airways.txt".

Step 1: Load the required packages.

What package(s) do we need to create our plot?

Load the package(s)

library(tidyverse)

Primarily, we need ggplot2, but other packages from the tidyverse are useful for handling factors or wrangling the data as needed.

Step 2: Load and view the data.

For this lesson, we need to load the differential expression results. How can we load the data and save to an object called dexp? The data is at "./data/diffexp_results_edger_airways.txt".

Load the data

dexp <- read_delim("./data/diffexp_results_edger_airways.txt")

Rows: 15926 Columns: 10
-- Column specification --------------------------------------------------------
Delimiter: "\t"
chr (4): feature, albut, transcript, ref_genome
dbl (5): logFC, logCPM, F, PValue, FDR
lgl (1): .abundant

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.

How can we further examine these data?

Examine the data

glimpse(dexp)

Rows: 15,926
Columns: 10
$ feature    <chr> "ENSG00000000003", "ENSG00000000419", "ENSG00000000457", "E~
$ albut      <chr> "untrt", "untrt", "untrt", "untrt", "untrt", "untrt", "untr~
$ transcript <chr> "TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2", "GCL~
$ ref_genome <chr> "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg~
$ .abundant  <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,~
$ logFC      <dbl> -0.390100222, 0.197802179, 0.029160865, -0.124382022, 0.417~
$ logCPM     <dbl> 5.059704, 4.611483, 3.482462, 1.473375, 8.089146, 5.909668,~
$ F          <dbl> 3.284948e+01, 6.903534e+00, 9.685073e-02, 3.772134e-01, 2.9~
$ PValue     <dbl> 0.0003117656, 0.0280616149, 0.7629129276, 0.5546956332, 0.0~
$ FDR        <dbl> 0.002831504, 0.077013489, 0.844247837, 0.682326613, 0.00376~

We can view the data using View(dexp) or select the data from the Global Environment pane.

To understand the structure of the data, use dplyr::glimpse() or str().

Step 3: Define significance

The volcano plot helps us identify our significant genes. Generally, we are interested in identifying genes above or below certain thresholds for significance and log fold change. These thresholds can be fairly arbitrary. Here, we will define significance based on values with an FDR less than 0.01 and an absolute value of logFC of 1. Of note, logFC here is represented by log2 transformed values, so logFC = 1 corresponds to a fold change of 2.

Create a new column in dexp called "Significant" that contains TRUE values where genes were significantly differentially expressed based on the above thresholds and FALSE where they were not significant. Order the data frame by FDR and logFC. Save these transformed data to a new object called dexp_sigtrnsc. Unfamiliar with how to wrangling the data? Check out Part 2 of this Series, Introduction to Data Wrangling.

Wrangle the data

dexp_sigtrnsc <- dexp %>%  
  mutate(Significant = FDR < 0.01 & abs(logFC) >= 1) %>% arrange(FDR, abs(logFC)) 
dexp_sigtrnsc[,-c(2,4,5)]

# A tibble: 15,926 x 8
   feature         transcript logFC logCPM     F    PValue       FDR Significant
   <chr>           <chr>      <dbl>  <dbl> <dbl>     <dbl>     <dbl> <lgl>      
 1 ENSG00000165995 CACNB2      3.28   4.51 1575. 3.34 e-11   4.07e-7 TRUE       
 2 ENSG00000109906 ZBTB16      7.15   4.15 1429. 5.11 e-11   4.07e-7 TRUE       
 3 ENSG00000106976 DNM1       -1.76   5.38  646. 1.62 e- 9   2.57e-6 TRUE       
 4 ENSG00000162493 PDPN        1.88   5.68  768. 7.60 e-10   2.57e-6 TRUE       
 5 ENSG00000154930 ACSS1       1.89   4.96  657. 1.50 e- 9   2.57e-6 TRUE       
 6 ENSG00000157214 STEAP2      1.97   7.13  685. 1.25 e- 9   2.57e-6 TRUE       
 7 ENSG00000146250 PRSS35     -2.76   3.91  807. 6.16 e-10   2.57e-6 TRUE       
 8 ENSG00000120129 DUSP1       2.94   7.31  694. 1.18 e- 9   2.57e-6 TRUE       
 9 ENSG00000152583 SPARCL1     4.56   5.53  721. 1.000e- 9   2.57e-6 TRUE       
10 ENSG00000168309 FAM107A     4.74   2.78  656. 1.51 e- 9   2.57e-6 TRUE       
# i 15,916 more rows

Because we arranged the data by significance, we can create an object with the top 6 significant genes to highlight these in our volcano plot. Save the names of these genes to an object called topgenes.

Get 6 top significant genes

topgenes<-dexp_sigtrnsc$transcript[1:6]
topgenes

[1] "CACNB2" "ZBTB16" "DNM1"   "PDPN"   "ACSS1"  "STEAP2"

Step 4: Create the plot beginning with our 3 required entities.

What are the 3 required components needed to create a plot?

Data

data - the data should include our differential expression results (dexp_sigtrnsc).
1 or more geoms

All data points are plotted using an x and y coordinate system. This requires geom_point().
Mapping aesthetics

x-axis - represents the logarithm of the fold change between two conditions.
y-axis - represents the negative logarithm (base 10) of the p-value on the y-axis, ensuring that data points with lower p-values—indicative of higher statistical significance—are positioned toward the top of the plot.

color - use color to differentiate between "significant" and "non-significant" genes.

Begin the plot

ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
  geom_point(aes( color = Significant))

Step 5: Customize Our Figure

At this point, you have a relatively nice plot with just a couple of lines of code, but we really want our figure to shine for publication. Think about what changes can be made to make the plot nice but also effective!

Scale the Colors

How can we control the colors representing our TRUE / FALSE values? Assign "black" to FALSE and "red" to TRUE.

Scale the Colors

ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
  geom_point(aes( color = Significant)) +
  scale_color_manual(values = c("black","red"))

Note

There are many scale functions and scale_color functions.

Add Size and Alpha attributes to our Mapping Aesthetics

The red and black colors nicely discriminate between significant and non-significant genes. However, we can make a few more changes to really highlight our "significant" genes. Two things come to mind. We can make the non-significant points less visible with alpha and size.

If we want to represent differences in a variable using alpha and size, where should we put these in our code?

Assign alpha and size aesthetics

ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
  geom_point(aes( color = Significant, alpha =Significant, 
  size = Significant)) +
  scale_color_manual(values = c("black","red"))

Warning messages

You will likely see the following warning messages:

1: Using alpha for a discrete variable is not advised.
2: Using size for a discrete variable is not advised.

These are not errors, but you should consider what they mean for your plot. Make sure your choices are not misleading the audience.

The size mapping results in very large points for "Significant = TRUE". How can we fix this?

Scale the size aesthetic

# Use scale_size_discrete to set the range of sizes possible
ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
  geom_point(aes( color = Significant, alpha =Significant, 
  size = Significant)) +
  scale_color_manual(values = c("black","red")) +
  scale_size_discrete(range=c(1,2))

Again, scale can be applied to the parameters in our mapping aesthetics including the x and y axes.

Legends

If we want separate legends for each aesthetic, we can set this using arguments in the scale functions. For example, see guide and name.

Fix the legend

The legend isn't great. It is neither informative nor visually appealing. How can we modify the legend?

Fix the legend

ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
  geom_point(aes( color = Significant, alpha =Significant, 
  size = Significant)) +
  scale_color_manual(values = c("black","red")) +
  scale_size_discrete(range=c(1,2)) +
    guides(color = guide_legend(
      "Significance (logFC \u2265 |1|, FDR < 0.01)"), size = "none", alpha= "none")

There are multiple ways to modify the legend, including using guides() and theme.

Adding mathematical expressions

There are multiple ways to add mathematical expressions to ggplot2 figures.

Here are some useful resouces:

From ggplot2 docs: https://ggplot2.tidyverse.org/articles/faq-axes.html#how-can-i-add-superscripts-and-subscripts-to-axis-labels
Guide on special symbols: https://steffilazerte.ca/posts/ggplot-symbols/#table
Using expression(): https://library.virginia.edu/data/articles/mathematical-annotation-in-r
?plotmath and demo(plotmath)

In this example, I used unicode, which is a universal character encoding standard assigning a unique number / code to every character, symbol, etc. In R and ggplot2, unicode can be used to display special symbols (like mathematical operators, Greek letters, or arrows) in plot labels, legends, and titles by using escape sequences such as \u2265 for "≥". However, it doesn't work with all graphic devices, so use caution.

As the references above suggest, we could have also used bquote() or expression(). For example, try the following code instead: guides(color = guide_legend(bquote("Significance (logFC " >= " |1|, FDR < 0.01)")), size = "none", alpha= "none"). Or, make the x and y axis labels nicer: labs(x=expression(paste(Log[2],"FC")),y=expression(paste(-Log[10],italic("P")))).

I would not necessarily memorize how to do this, but would look it up as needed.

Clean it up with `theme`

Let's make this nicer by customizing the background, grid lines, legend position, and text. How can we modify theme elements?

Set theme elements

ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
  geom_point(aes( color = Significant, alpha =Significant, 
  size = Significant)) +
  scale_color_manual(values = c("black","red")) +
  scale_size_discrete(range=c(1,2)) +
    guides(color = guide_legend(
      "Significance (logFC \u2265 |1|, FDR < 0.01)"), size = "none", alpha= "none") +
  theme_classic() +
  theme(panel.grid.major = element_line(size = 0.2, color="grey"),
        panel.grid.minor = element_line(size = 0.1, color="grey"),
        text = element_text(size = 12),
        legend.position = "bottom")

You are free to customize your plot however you see fit. Here, I decided to use the complete theme, theme_classic(). I then made some additional changes from there. For example, I added in major and minor grid lines, resized the text, and positioned the legend.

Add major grid lines: panel.grid.major = element_line(size = 0.2, color="grey")
Add minor grid lines: panel.grid.minor = element_line(size = 0.1, color="grey")
Assign all text 12 point font: text = element_text(size = 12)
Move the legend to the bottom of the plot: legend.position = "bottom"

Step 6: Label the most significant points.

How can we add text labels to some of our points?

To label our top significant genes, we can add an additional geom. In this case, geom_text().

Note

Here, we only want to plot labels for our significant genes. We can call these directly by filtering the data.

Add text labels

ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
  geom_point(aes( color = Significant, alpha =Significant, 
  size = Significant)) +
  scale_color_manual(values = c("black","red")) +
  scale_size_discrete(range=c(1,2)) +
    guides(color = guide_legend(
      "Significance (logFC \u2265 |1|, FDR < 0.01)"), 
      size = "none", alpha= "none") +
  geom_text(data=dexp_sigtrnsc %>% 
                    filter(transcript %in% topgenes), #filter the data
                  aes(label=transcript)) +
  theme_classic() +
  theme(panel.grid.major = element_line(size = 0.2, color="grey"),
        panel.grid.minor = element_line(size = 0.1, color="grey"),
        text = element_text(size = 12),
        legend.position = "bottom")

As we can see, geom_text results in overlapping labels. To avoid overlapping labels, we can use check_overlap = TRUE - feel free to try it. However, this will drop labels, and we want all 6 top genes to have labels.

To get around this, we can use a package called ggrepel, which keeps the labels from overlapping.

Use ggrepel to avoid overlapping labels

# install the package with install.packages("ggrepel")
library(ggrepel)

ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
  geom_point(aes( color = Significant, alpha =Significant, 
  size = Significant)) +
  scale_size_discrete(range=c(1,2)) +
  scale_color_manual(values = c("black","red")) +
    guides(color = guide_legend(
      "Significance (logFC \u2265 |1|, FDR < 0.01)"), 
      size = "none", alpha= "none") +
  geom_text_repel(data=dexp_sigtrnsc %>% 
                    filter(transcript %in% topgenes), 
                  aes(label=transcript),
                  nudge_y=0.1,nudge_x=0.2,direction="both",
                  segment.color="gray") +
  theme_classic() +
  theme(panel.grid.major = element_line(size = 0.2, color="grey"),
        panel.grid.minor = element_line(size = 0.1, color="grey"),
        text = element_text(size = 12),
        legend.position = "bottom")

Geom ordering

In ggplot2, the order in which you add layers (such as geom_point(), geom_text(), geom_line(), etc.) directly affects how your plot is rendered:

Layers added later are drawn on top of earlier layers. For example, if you add geom_point() first and then geom_text(), the text labels will appear on top of the points. If you reverse the order, the points may cover or obscure the text.

Using an External Package.

There are many packages external to ggplot2 that can be used to create or enhance figures. We will learn about some of these in the next lesson. Such packages can save us a lot of time and energy. See the below example with EnhancedVolcano.

Note

Search for packages using a dedicated R search Engine.

EnhancedVolcano

There is a dedicated Bioconductor package for creating volcano plots specifically called EnhancedVolcano. Plots created using this package can be customized using ggplot2 functions and syntax.

#The default cut-off for log2FC is >|2|
#the default cut-off for log10 p-value is 10e-6
library(EnhancedVolcano)
EnhancedVolcano(dexp_sigtrnsc,
                title = "Enhanced Volcano with Airways",
                lab = dexp_sigtrnsc$transcript,
                x = 'logFC',
                y = 'FDR')

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
i Please use `linewidth` instead.
i The deprecated feature was likely used in the EnhancedVolcano package.
  Please report the issue to the authors.

This creates a very nice plot rather quickly.

Adding horizontal and vertical lines

The horizontal and vertical lines can be added to our ggplot2 figure using geom_hline() and geom_vline(), respectively.

Acknowledgements

The volcano plot code in this lesson was adapted from a 2021 workshop entitled Introduction to Tidy Transciptomics by Maria Doyle and Stefano Mangiola.

From Data to Display: Crafting a Publishable Plot

Learning Objectives

Step 1: Load the required packages.

Step 2: Load and view the data.

Step 3: Define significance

Step 4: Create the plot beginning with our 3 required entities.

Step 5: Customize Our Figure

Scale the Colors

Add Size and Alpha attributes to our Mapping Aesthetics

Fix the legend

Clean it up with theme

Step 6: Label the most significant points.

Using an External Package.

EnhancedVolcano

Acknowledgements

Clean it up with `theme`