Introduction to Data Visualization with R (Part 2)
Objectives
- Review the grammar of graphics template.
- Learn about the statistical transformations inherent to geoms.
- Learn more about fine tuning figures with labels, legends, scales, and themes.
- Learn how to save plots with
ggsave()
. - Review general tips for creating publishable figures.
Our grammar of graphics template
Last lesson we discussed the three basic components of creating a ggplot2
plot: the data, one or more geoms, and aesthetic mappings.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
But, we also learned of other features that greatly improve our figures, and today we will be expanding our ggplot2
template even further to include:
one or more datasets,
one or more geometric objects that serve as the visual representations of the data, – for instance, points, lines, rectangles, contours,
descriptions of how the variables in the data are mapped to visual properties (aesthetics) of the geometric objects, and an associated scale (e. g., linear, logarithmic, rank),
a facet specification, i.e. the use of multiple similar subplots to look at subsets of the same data,
one or more coordinate systems,
optional parameters that affect the layout and rendering, such text size, font and alignment, legend positions.
statistical summarization rules
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>
) +
<FACET_FUNCTION> +
<COORDINATE SYSTEM> +
<THEME>
Loading the libraries
To begin plotting, let's load our tidyverse
library.
#load libraries
library(tidyverse) # Tidyverse automatically loads ggplot2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Importing the data
We also need some data to plot, so if you haven't already, let's load the data we will need for this lesson.
#scaled_counts
#We used this in lesson 2 so you may not need to reload
scaled_counts<-
read.delim("./data/filtlowabund_scaledcounts_airways.txt",
as.is=TRUE)
dexp<-read.delim("./data/diffexp_results_edger_airways.txt",
as.is=TRUE)
#let's get some data
#we are only interested in transcript counts greater than 100
#read in the data
sc<-read.csv("./data/sc.csv")
Statistical transformations
Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:
- bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
- smoothers fit a model to your data and then plot predictions from the model.
- boxplots compute a robust summary of the distribution and then display a specially formatted box. The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. --- R4DS
Let's plot a bar graph using the data (sc
).
#returns an error message. What went wrong?
ggplot(data=sc) +
geom_bar( aes(x=Num_transcripts, y = TotalCounts))
## Error in `geom_bar()`:
## ! Problem while computing stat.
## ℹ Error occurred in the 1st layer.
## Caused by error in `setup_params()`:
## ! `stat_count()` must only have an x or y aesthetic.
What's the difference between stat identity and stat count?
ggplot(data=sc) +
geom_bar( aes(x=Num_transcripts, y = TotalCounts), stat="identity")
As we can see, stat="identity"
returns the raw data.
Let's look at another example.
#Let's filter our data to only include 4 transcripts of interest
#We used this code in the tidyverse lesson
keep_t<-c("CPD","EXT1","MCL1","LASP1")
interesting_trnsc<-scaled_counts %>%
filter(transcript %in% keep_t)
#the default here is `stat_count()`
ggplot(data = interesting_trnsc) +
geom_bar(mapping = aes(x = transcript, y=counts_scaled))
## Error in `geom_bar()`:
## ! Problem while computing stat.
## ℹ Error occurred in the 1st layer.
## Caused by error in `setup_params()`:
## ! `stat_count()` must only have an x or y aesthetic.
#Let's take away the y aesthetic
ggplot(data = interesting_trnsc) +
geom_bar(mapping = aes(x = transcript))
This is not a very useful figure, and probably not worth plotting. We could have gotten this info using str()
. However, the point here is that there are default statistical transformations occurring with many geoms, and you can specify alternatives.
Let's change the stat parameter to "identity". This will plot the raw values of the normalized counts rather than how many rows are present for each transcript.
#defaulted to a stacked barplot
ggplot(data = interesting_trnsc) +
geom_bar(mapping = aes(x = transcript,y=counts_scaled,
fill=SampleName),
stat="identity",color="black") +
facet_wrap(~dex)
What if we wanted the columns side by side?
#introducing the position argument, position="dodge"
ggplot(data = interesting_trnsc) +
geom_bar(mapping = aes(x = transcript,y=counts_scaled,
fill=SampleName),
stat="identity",color="black",position="dodge") +
facet_wrap(~dex)
How do we know what the default stat is for geom_bar()
? Well, we could read the documentation, ?geom_bar()
. This is true of multiple geoms. The statistical transformation can often be customized, so if the default is not what you need, check out the documentation to learn more about how to make modifications. For example, you could provide custom mapping for a box plot. To do this, see the examples section of the geom_boxplot()
documentation.
Coordinate systems
ggplot2
uses a default coordinate system (the Cartesian coordinate system). This isn't super important until we want to do something like make a map (See coord_quickmap()
) or create a pie chart (See coord_polar()
).
When will we have to think about coordinate systems? We likely won't have to modify from default in too many cases (see those above). The most common circumstance in which we will likely need to change the coordinate system is in the event that we want to switch the x and y axes (?coord_flip()
) or if we want to fix our aspect ratio (?coord_fixed()
).
#let's return to our bar plot above
#get horizontal bars instead of vertical bars
ggplot(data = interesting_trnsc) +
geom_bar(mapping = aes(x = transcript,y=counts_scaled,
fill=SampleName),
stat="identity",color="black",position="dodge") +
facet_wrap(~dex) +
coord_flip()
Labels, legends, scales, and themes
How do we ultimately get our figures to a publishable state? The bread and butter of pretty plots really falls to the additional non-data layers of our ggplot2 code. These layers will include code to label the axes, scale the axes, and customize the legends and theme.
The default axes and legend titles come from the ggplot2 code.
ggplot(data=sc) +
geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
shape=21,size=2) +
scale_fill_manual(values=c("purple", "yellow"))
In the above plot, the y-axis label (TotalCounts) is the variable name mapped to the y aesthetic, while the x-axis label (Num_transcripts) is the variable name named to the x aesthetic. The fill aesthetic was set equal to "dex", and so this became the default title of the fill legend. We can change these labels using ylab()
, xlab()
, or labs()
, and guide()
for the legend.
ggplot(data=sc) +
geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
shape=21,size=2) +
scale_fill_manual(values=c("purple", "yellow"),
labels=c('treated','untreated'))+
#can change labels of fill levels along with colors
xlab("Recovered transcripts per sample") + #add x label
ylab("Total sequences per sample") #add y label
Let's change the legend title.
ggplot(data=sc) +
geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
shape=21,size=2) +
scale_fill_manual(values=c("purple", "yellow"),
labels=c('treated','untreated'))+
#can change labels of fill levels along with colors
xlab("Recovered transcripts per sample") + #add x label
ylab("Total sequences per sample") +#add y label
guides(fill = guide_legend(title="Treatment"))
We can modify the axes scales of continuous variables using scale_x_contiuous()
and scale_y_continuous()
. Discrete (categorical variable) axes can be modified using scale_x_discrete()
and scale_y_discrete()
.
ggplot(data=sc) +
geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
shape=21,size=2) +
scale_fill_manual(values=c("purple", "yellow"),
labels=c('treated','untreated'))+
#can change labels of fill levels along with colors
xlab("Recovered transcripts per sample") + #add x label
ylab("Total sequences per sample") +#add y label
guides(fill = guide_legend(title="Treatment")) + #label the legend
scale_y_continuous(breaks=seq(1.0e7, 3.5e7, by = 2e6),
limits=c(1.0e7,3.5e7)) #change breaks and limits
Perhaps we want to represent these data on a logarithmic scale.
ggplot(data=sc) +
geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
shape=21,size=2) +
scale_fill_manual(values=c("purple", "yellow"),
labels=c('treated','untreated'))+
#can change labels of fill levels along with colors
xlab("Recovered transcripts per sample") + #add x label
ylab("Total sequences per sample") +#add y label
guides(fill = guide_legend(title="Treatment")) + #label the legend
scale_y_continuous(trans="log10") #use the trans argument
Note
You could manually transform the data without transforming the scales. The figures would be the same, excluding the axes labels. When you use the transformed scale (e.g., scale_y_continuous(trans="log10")
or scale_y_log10()
), the axis labels remain in the original data space. When the data is transformed manually, the labels will also be transformed.
Finally, we can change the overall look of non-data elements of our plot (titles, labels, fonts, background, grid lines, and legends) by customizing ggplot2 themes. Check out ?ggplot2::theme()
. For a list of available parameters. ggplot2
provides 8 complete themes, with theme_gray()
as the default theme.
You can also create your own custom theme and then apply it to all figures in a plot.
Create a custom theme to use with multiple figures.
#Setting a theme
my_theme <-
theme_bw() +
theme(
panel.border = element_blank(),
axis.line = element_line(),
panel.grid.major = element_line(size = 0.2),
panel.grid.minor = element_line(size = 0.1),
text = element_text(size = 12),
legend.position = "bottom",
axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1)
)
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(data=sc) +
geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
shape=21,size=2) +
scale_fill_manual(values=c("purple", "yellow"),
labels=c('treated','untreated'))+
#can change labels of fill levels along with colors
xlab("Recovered transcripts per sample") + #add x label
ylab("Total sequences per sample") +#add y label
guides(fill = guide_legend(title="Treatment")) + #label the legend
scale_y_continuous(trans="log10") + #use the trans argument
my_theme
Saving plots (ggsave()
)
Finally, we have a quality plot ready to publish. The next step is to save our plot to a file. The easiest way to do this with ggplot2 is ggsave()
. This function will save the last plot that you displayed by default. Look at the function parameters using ?ggsave()
.
ggsave("Plot1.png",width=5.5,height=3.5,units="in",dpi=300)
Nice plot example
These steps can be used to create a publish worthy figure. For example, let's create a volcano plot of our differential expression results.
A volcano plot is a type of scatterplot that shows statistical significance (P value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes. --- Maria Doyle, 2021
#get the data
dexp_sigtrnsc<-dexp %>%
mutate(Significant = FDR < 0.05 & abs(logFC) >= 2) %>% arrange(FDR)
topgenes<-dexp_sigtrnsc$transcript[1:6]
Plot
#install.packages(ggrepel)
library(ggrepel)
ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = log10(FDR))) +
geom_point(aes( color = Significant, size = Significant,
alpha = Significant)) +
geom_text_repel(data=dexp_sigtrnsc %>%
filter(transcript %in% topgenes),
aes(label=transcript),
nudge_y=0.5,hjust=0.5,direction="y",
segment.color="gray") +
scale_y_reverse(limits=c(0,-7))+
scale_color_manual(values = c("black", "#e11f28")) +
scale_size_discrete(range = c(0, 2)) +
guides(size = "none", alpha= "none")+
my_theme
## Warning: Using size for a discrete variable is not advised.
## Warning: Using alpha for a discrete variable is not advised.
Enhanced Volcano
There is a dedicated package for creating volcano plots available in Bioconductor, EnhancedVolcano. Plots created using this package can be customized using ggplot2 functions and syntax.
#The default cut-off for log2FC is >|2|
#the default cut-off for log10 p-value is 10e-6
library(EnhancedVolcano)
EnhancedVolcano(dexp_sigtrnsc,
title = "Enhanced Volcano with Airways",
lab = dexp_sigtrnsc$transcript,
x = 'logFC',
y = 'FDR')
Recommendations for creating publishable figures
(Inspired by Visualizing Data in the Tidyverse, a Coursera lesson)
- Consider whether the plot type you have chosen is the best way to convey your message
-
Make your plot visually appealing
- Careful color selection - color blind friendly if possible (e.g.,
library(viridis)
) - Eliminate unnecessary white space
- Carefully choose themes including font types
- Careful color selection - color blind friendly if possible (e.g.,
-
Label all axes with concise and informative labels
- These labels should be straight forward and adequately describe the data
-
Ask yourself "Does the data make sense?"
- Does the data plotted address the question you are answering?
-
Try not to mislead the audience
- Often this means starting the y-axis at 0
- Keep axes consistent when arranging facets or multiple plots
- Keep colors consistent across plots
-
Do not try to convey too much information in the same plot
- Keep plots fairly simple
Complementary packages
There are many complementary R packages related to creating publishable figures using ggplot2. Check out the packages cowplot
and ggpubr
. Cowplot is particularly great for providing functions that facilitate arranging multiple plots in a grid panel. Usually publications restrict the number of figures allowed, and so it is helpful to be able to group multiple figures into a single figure panel. GGpubr is particularly great for beginners, providing easy code to make publish worthy figures. It is particularly great for stats integration and easily incorporating brackets and p-values for group comparisons.
Acknowledgements
Material from this lesson was adapted from Chapter 3 of R for Data Science and from a 2021 workshop entitled Introduction to Tidy Transciptomics by Maria Doyle and Stefano Mangiola.