From Data to Display: Crafting a Publishable Plot
Learning Objectives
- Integrate previously learned
ggplot2skills including data mapping, geoms, labels, scales, and themes to construct a complete visualization workflow. - Design and produce a polished, publication-ready plot from start to finish, making informed choices about plot type, aesthetics, and formatting.
For this exercise, we are going to use the information we have learned to create a volcano plot of our differential expression results.
Warning
This lesson requires audience participation.
Try not to cheat. Attempt to add the necessary code without referring to the documentation. To help you with this, code blocks are collapsed to hide the code.
A volcano plot is a type of scatterplot that shows statistical significance (P-value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes. --- Maria Doyle, 2021
To generate a volcano plot, we need to know which genes were differentially expressed. Differential expression results can be obtained using a number of R packages (e.g., limma, edgeR, DESeq2). For today's lesson, we are using output generated from edgeR and available in the file, "./data/diffexp_results_edger_airways.txt".
Step 1: Load the required packages.
What package(s) do we need to create our plot?
Load the package(s)
library(tidyverse)
Primarily, we need ggplot2, but other packages from the tidyverse are useful for handling factors or wrangling the data as needed.
Step 2: Load and view the data.
For this lesson, we need to load the differential expression results. How can we load the data and save to an object called dexp? The data is at "./data/diffexp_results_edger_airways.txt".
Load the data
dexp <- read_delim("./data/diffexp_results_edger_airways.txt")
Rows: 15926 Columns: 10
-- Column specification --------------------------------------------------------
Delimiter: "\t"
chr (4): feature, albut, transcript, ref_genome
dbl (5): logFC, logCPM, F, PValue, FDR
lgl (1): .abundant
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
How can we further examine these data?
Examine the data
glimpse(dexp)
Rows: 15,926
Columns: 10
$ feature <chr> "ENSG00000000003", "ENSG00000000419", "ENSG00000000457", "E~
$ albut <chr> "untrt", "untrt", "untrt", "untrt", "untrt", "untrt", "untr~
$ transcript <chr> "TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2", "GCL~
$ ref_genome <chr> "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg~
$ .abundant <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,~
$ logFC <dbl> -0.390100222, 0.197802179, 0.029160865, -0.124382022, 0.417~
$ logCPM <dbl> 5.059704, 4.611483, 3.482462, 1.473375, 8.089146, 5.909668,~
$ F <dbl> 3.284948e+01, 6.903534e+00, 9.685073e-02, 3.772134e-01, 2.9~
$ PValue <dbl> 0.0003117656, 0.0280616149, 0.7629129276, 0.5546956332, 0.0~
$ FDR <dbl> 0.002831504, 0.077013489, 0.844247837, 0.682326613, 0.00376~
We can view the data using View(dexp) or select the data from the Global Environment pane.
To understand the structure of the data, use dplyr::glimpse() or str().
Step 3: Define significance
The volcano plot helps us identify our significant genes. Generally, we are interested in identifying genes above or below certain thresholds for significance and log fold change. These thresholds can be fairly arbitrary. Here, we will define significance based on values with an FDR less than 0.01 and an absolute value of logFC of 1. Of note, logFC here is represented by log2 transformed values, so logFC = 1 corresponds to a fold change of 2.
Create a new column in dexp called "Significant" that contains TRUE values where genes were significantly differentially expressed based on the above thresholds and FALSE where they were not significant. Order the data frame by FDR and logFC. Save these transformed data to a new object called dexp_sigtrnsc. Unfamiliar with how to wrangling the data? Check out Part 2 of this Series, Introduction to Data Wrangling.
Wrangle the data
dexp_sigtrnsc <- dexp %>%
mutate(Significant = FDR < 0.01 & abs(logFC) >= 1) %>% arrange(FDR, abs(logFC))
dexp_sigtrnsc[,-c(2,4,5)]
# A tibble: 15,926 x 8
feature transcript logFC logCPM F PValue FDR Significant
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 ENSG00000165995 CACNB2 3.28 4.51 1575. 3.34 e-11 4.07e-7 TRUE
2 ENSG00000109906 ZBTB16 7.15 4.15 1429. 5.11 e-11 4.07e-7 TRUE
3 ENSG00000106976 DNM1 -1.76 5.38 646. 1.62 e- 9 2.57e-6 TRUE
4 ENSG00000162493 PDPN 1.88 5.68 768. 7.60 e-10 2.57e-6 TRUE
5 ENSG00000154930 ACSS1 1.89 4.96 657. 1.50 e- 9 2.57e-6 TRUE
6 ENSG00000157214 STEAP2 1.97 7.13 685. 1.25 e- 9 2.57e-6 TRUE
7 ENSG00000146250 PRSS35 -2.76 3.91 807. 6.16 e-10 2.57e-6 TRUE
8 ENSG00000120129 DUSP1 2.94 7.31 694. 1.18 e- 9 2.57e-6 TRUE
9 ENSG00000152583 SPARCL1 4.56 5.53 721. 1.000e- 9 2.57e-6 TRUE
10 ENSG00000168309 FAM107A 4.74 2.78 656. 1.51 e- 9 2.57e-6 TRUE
# i 15,916 more rows
Because we arranged the data by significance, we can create an object with the top 6 significant genes to highlight these in our volcano plot. Save the names of these genes to an object called topgenes.
Get 6 top significant genes
topgenes<-dexp_sigtrnsc$transcript[1:6]
topgenes
[1] "CACNB2" "ZBTB16" "DNM1" "PDPN" "ACSS1" "STEAP2"
Step 4: Create the plot beginning with our 3 required entities.
What are the 3 required components needed to create a plot?
-
Data
data - the data should include our differential expression results (
dexp_sigtrnsc). -
1 or more geoms
All data points are plotted using an x and y coordinate system. This requires
geom_point(). -
Mapping aesthetics
x-axis - represents the logarithm of the fold change between two conditions.
y-axis - represents the negative logarithm (base 10) of the p-value on the y-axis, ensuring that data points with lower p-values—indicative of higher statistical significance—are positioned toward the top of the plot.color - use color to differentiate between "significant" and "non-significant" genes.
Begin the plot
ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
geom_point(aes( color = Significant))

Step 5: Customize Our Figure
At this point, you have a relatively nice plot with just a couple of lines of code, but we really want our figure to shine for publication. Think about what changes can be made to make the plot nice but also effective!
Scale the Colors
How can we control the colors representing our TRUE / FALSE values? Assign "black" to FALSE and "red" to TRUE.
Scale the Colors
ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
geom_point(aes( color = Significant)) +
scale_color_manual(values = c("black","red"))

Note
There are many scale functions and scale_color functions.
Add Size and Alpha attributes to our Mapping Aesthetics
The red and black colors nicely discriminate between significant and non-significant genes. However, we can make a few more changes to really highlight our "significant" genes. Two things come to mind. We can make the non-significant points less visible with alpha and size.
If we want to represent differences in a variable using alpha and size, where should we put these in our code?
Assign alpha and size aesthetics
ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
geom_point(aes( color = Significant, alpha =Significant,
size = Significant)) +
scale_color_manual(values = c("black","red"))

Warning messages
You will likely see the following warning messages:
1: Using alpha for a discrete variable is not advised.
2: Using size for a discrete variable is not advised.
These are not errors, but you should consider what they mean for your plot. Make sure your choices are not misleading the audience.
The size mapping results in very large points for "Significant = TRUE". How can we fix this?
Scale the size aesthetic
# Use scale_size_discrete to set the range of sizes possible
ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
geom_point(aes( color = Significant, alpha =Significant,
size = Significant)) +
scale_color_manual(values = c("black","red")) +
scale_size_discrete(range=c(1,2))

Again, scale can be applied to the parameters in our mapping aesthetics including the x and y axes.
Legends
If we want separate legends for each aesthetic, we can set this using arguments in the scale functions. For example, see guide and name.
Fix the legend
The legend isn't great. It is neither informative nor visually appealing. How can we modify the legend?
Fix the legend
ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
geom_point(aes( color = Significant, alpha =Significant,
size = Significant)) +
scale_color_manual(values = c("black","red")) +
scale_size_discrete(range=c(1,2)) +
guides(color = guide_legend(
"Significance (logFC \u2265 |1|, FDR < 0.01)"), size = "none", alpha= "none")

There are multiple ways to modify the legend, including using guides() and theme.
Adding mathematical expressions
There are multiple ways to add mathematical expressions to ggplot2 figures.
Here are some useful resouces:
- From ggplot2 docs: https://ggplot2.tidyverse.org/articles/faq-axes.html#how-can-i-add-superscripts-and-subscripts-to-axis-labels
- Guide on special symbols: https://steffilazerte.ca/posts/ggplot-symbols/#table
- Using
expression(): https://library.virginia.edu/data/articles/mathematical-annotation-in-r ?plotmathanddemo(plotmath)
In this example, I used unicode, which is a universal character encoding standard assigning a unique number / code to every character, symbol, etc. In R and ggplot2, unicode can be used to display special symbols (like mathematical operators, Greek letters, or arrows) in plot labels, legends, and titles by using escape sequences such as \u2265 for "≥". However, it doesn't work with all graphic devices, so use caution.
As the references above suggest, we could have also used bquote() or expression(). For example, try the following code instead: guides(color = guide_legend(bquote("Significance (logFC " >= " |1|, FDR < 0.01)")), size = "none", alpha= "none"). Or, make the x and y axis labels nicer: labs(x=expression(paste(Log[2],"FC")),y=expression(paste(-Log[10],italic("P")))).
I would not necessarily memorize how to do this, but would look it up as needed.
Clean it up with theme
Let's make this nicer by customizing the background, grid lines, legend position, and text. How can we modify theme elements?
Set theme elements
ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
geom_point(aes( color = Significant, alpha =Significant,
size = Significant)) +
scale_color_manual(values = c("black","red")) +
scale_size_discrete(range=c(1,2)) +
guides(color = guide_legend(
"Significance (logFC \u2265 |1|, FDR < 0.01)"), size = "none", alpha= "none") +
theme_classic() +
theme(panel.grid.major = element_line(size = 0.2, color="grey"),
panel.grid.minor = element_line(size = 0.1, color="grey"),
text = element_text(size = 12),
legend.position = "bottom")

You are free to customize your plot however you see fit. Here, I decided to use the complete theme, theme_classic(). I then made some additional changes from there. For example, I added in major and minor grid lines, resized the text, and positioned the legend.
- Add major grid lines:
panel.grid.major = element_line(size = 0.2, color="grey") - Add minor grid lines:
panel.grid.minor = element_line(size = 0.1, color="grey") - Assign all text 12 point font:
text = element_text(size = 12) - Move the legend to the bottom of the plot:
legend.position = "bottom"
Step 6: Label the most significant points.
How can we add text labels to some of our points?
To label our top significant genes, we can add an additional geom. In this case, geom_text().
Note
Here, we only want to plot labels for our significant genes. We can call these directly by filtering the data.
Add text labels
ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
geom_point(aes( color = Significant, alpha =Significant,
size = Significant)) +
scale_color_manual(values = c("black","red")) +
scale_size_discrete(range=c(1,2)) +
guides(color = guide_legend(
"Significance (logFC \u2265 |1|, FDR < 0.01)"),
size = "none", alpha= "none") +
geom_text(data=dexp_sigtrnsc %>%
filter(transcript %in% topgenes), #filter the data
aes(label=transcript)) +
theme_classic() +
theme(panel.grid.major = element_line(size = 0.2, color="grey"),
panel.grid.minor = element_line(size = 0.1, color="grey"),
text = element_text(size = 12),
legend.position = "bottom")

As we can see, geom_text results in overlapping labels. To avoid overlapping labels, we can use check_overlap = TRUE - feel free to try it. However, this will drop labels, and we want all 6 top genes to have labels.
To get around this, we can use a package called ggrepel, which keeps the labels from overlapping.
Use ggrepel to avoid overlapping labels
# install the package with install.packages("ggrepel")
library(ggrepel)
ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = -log10(FDR))) +
geom_point(aes( color = Significant, alpha =Significant,
size = Significant)) +
scale_size_discrete(range=c(1,2)) +
scale_color_manual(values = c("black","red")) +
guides(color = guide_legend(
"Significance (logFC \u2265 |1|, FDR < 0.01)"),
size = "none", alpha= "none") +
geom_text_repel(data=dexp_sigtrnsc %>%
filter(transcript %in% topgenes),
aes(label=transcript),
nudge_y=0.1,nudge_x=0.2,direction="both",
segment.color="gray") +
theme_classic() +
theme(panel.grid.major = element_line(size = 0.2, color="grey"),
panel.grid.minor = element_line(size = 0.1, color="grey"),
text = element_text(size = 12),
legend.position = "bottom")

Geom ordering
In ggplot2, the order in which you add layers (such as geom_point(), geom_text(), geom_line(), etc.) directly affects how your plot is rendered:
Layers added later are drawn on top of earlier layers. For example, if you add geom_point() first and then geom_text(), the text labels will appear on top of the points. If you reverse the order, the points may cover or obscure the text.
Using an External Package.
There are many packages external to ggplot2 that can be used to create or enhance figures. We will learn about some of these in the next lesson. Such packages can save us a lot of time and energy. See the below example with EnhancedVolcano.
Note
Search for packages using a dedicated R search Engine.
EnhancedVolcano
There is a dedicated Bioconductor package for creating volcano plots specifically called EnhancedVolcano. Plots created using this package can be customized using ggplot2 functions and syntax.
#The default cut-off for log2FC is >|2|
#the default cut-off for log10 p-value is 10e-6
library(EnhancedVolcano)
EnhancedVolcano(dexp_sigtrnsc,
title = "Enhanced Volcano with Airways",
lab = dexp_sigtrnsc$transcript,
x = 'logFC',
y = 'FDR')
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
i Please use `linewidth` instead.
i The deprecated feature was likely used in the EnhancedVolcano package.
Please report the issue to the authors.

This creates a very nice plot rather quickly.
Adding horizontal and vertical lines
The horizontal and vertical lines can be added to our ggplot2 figure using geom_hline() and geom_vline(), respectively.
Acknowledgements
The volcano plot code in this lesson was adapted from a 2021 workshop entitled Introduction to Tidy Transciptomics by Maria Doyle and Stefano Mangiola.