ggplot2 Continued

Objectives

Review the grammar of graphics template.
Learn about the statistical transformations inherent to geoms.
Learn more about fine tuning figures with labels, legends, scales, and themes.
Learn how to save plots with ggsave().
Review general tips for creating publishable figures.

Reminder: Uploading files from RStudio Server

Any files created by you today will be erased at the end of the session. You can upload any files you downloaded from the last session using the Upload option in the Files pane.

Our grammar of graphics template

Last lesson we discussed the three basic components of creating a ggplot2 plot: the data, one or more geoms, and aesthetic mappings.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

But, we also learned of other features that greatly improve our figures, and today we will be expanding our ggplot2 template even further to include:

one or more datasets,

one or more geometric objects that serve as the visual representations of the data, – for instance, points, lines, rectangles, contours,

descriptions of how the variables in the data are mapped to visual properties (aesthetics) of the geometric objects, and an associated scale (e. g., linear, logarithmic, rank),

a facet specification, i.e. the use of multiple similar subplots to look at subsets of the same data,

one or more coordinate systems,

optional parameters that affect the layout and rendering, such text size, font and alignment, legend positions.

statistical summarization rules

---(Holmes and Huber, 2021)

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>
  ) +
  <FACET_FUNCTION> +
  <COORDINATE SYSTEM> +
  <THEME>

Loading the libraries

To begin plotting, let's load our tidyverse library.

#load libraries
library(tidyverse) # Tidyverse automatically loads ggplot2

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Importing the data

We also need some data to plot, so if you haven't already, let's load the data we will need for this lesson.

#scaled_counts
#We used this in lesson 2 so you may not need to reload
scaled_counts<-
  read.delim("./data/filtlowabund_scaledcounts_airways.txt", 
             as.is=TRUE)

dexp<-read.delim("./data/diffexp_results_edger_airways.txt", 
                 as.is=TRUE)  

#let's get some data
#we are only interested in transcript counts greater than 100
#read in the data
sc<-read.csv("./data/sc.csv")

Statistical transformations

Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:

bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.

smoothers fit a model to your data and then plot predictions from the model.

boxplots compute a robust summary of the distribution and then display a specially formatted box. The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. --- R4DS

Let's plot a bar graph using the data (sc).

#returns an error message. What went wrong?
ggplot(data=sc) + 
  geom_bar( aes(x=Num_transcripts, y = TotalCounts))

## Error in `f()`:
## ! stat_count() can only have an x or y aesthetic.

What's the difference between stat identity and stat count?

ggplot(data=sc) + 
  geom_bar( aes(x=Num_transcripts, y = TotalCounts), stat="identity")

As we can see, stat="identity" returns the raw data.

Let's look at another example.

#Let's filter our data to only include 4 transcripts of interest
#We used this code in the tidyverse lesson
keep_t<-c("CPD","EXT1","MCL1","LASP1")
interesting_trnsc<-scaled_counts %>% 
  filter(transcript %in% keep_t) 

#the default here is `stat_count()`
ggplot(data = interesting_trnsc) + 
  geom_bar(mapping = aes(x = transcript, y=counts_scaled))

## Error in `f()`:
## ! stat_count() can only have an x or y aesthetic.

#Let's take away the y aesthetic
ggplot(data = interesting_trnsc) + 
  geom_bar(mapping = aes(x = transcript))

This is not a very useful figure, and probably not worth plotting. We could have gotten this info using str(). However, the point here is that there are default statistical transformations occurring with many geoms, and you can specify alternatives.

Let's change the stat parameter to "identity". This will plot the raw values of the normalized counts rather than how many rows are present for each transcript.

#defaulted to a stacked barplot
ggplot(data = interesting_trnsc) + 
  geom_bar(mapping = aes(x = transcript,y=counts_scaled,
                         fill=SampleName),
           stat="identity",color="black") + 
  facet_wrap(~dex)

What if we wanted the columns side by side?

#introducing the position argument, position="dodge"
ggplot(data = interesting_trnsc) + 
  geom_bar(mapping = aes(x = transcript,y=counts_scaled,
                         fill=SampleName),
           stat="identity",color="black",position="dodge") + 
  facet_wrap(~dex)

How do we know what the default stat is for geom_bar()? Well, we could read the documentation, ?geom_bar(). This is true of multiple geoms. The statistical transformation can often be customized, so if the default is not what you need, check out the documentation to learn more about how to make modifications. For example, you could provide custom mapping for a box plot. To do this, see the examples section of the geom_boxplot() documentation.

Coordinate systems

ggplot2 uses a default coordinate system (the Cartesian coordinate system). This isn't super important until we want to do something like make a map (See coord_quickmap()) or create a pie chart (See coord_polar()).

When will we have to think about coordinate systems? We likely won't have to modify from default in too many cases (see those above). The most common circumstance in which we will likely need to change the coordinate system is in the event that we want to switch the x and y axes (?coord_flip()) or if we want to fix our aspect ratio (?coord_fixed()).

#let's return to our bar plot above
#get horizontal bars instead of vertical bars

ggplot(data = interesting_trnsc) + 
  geom_bar(mapping = aes(x = transcript,y=counts_scaled,
                         fill=SampleName),
           stat="identity",color="black",position="dodge") + 
  facet_wrap(~dex) +
  coord_flip()

Labels, legends, scales, and themes

How do we ultimately get our figures to a publishable state? The bread and butter of pretty plots really falls to the additional non-data layers of our ggplot2 code. These layers will include code to label the axes, scale the axes, and customize the legends and theme.

The default axes and legend titles come from the ggplot2 code.

ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
             shape=21,size=2) + 
  scale_fill_manual(values=c("purple", "yellow"))

In the above plot, the y-axis label (TotalCounts) is the variable name mapped to the y aesthetic, while the x-axis label (Num_transcripts) is the variable name named to the x aesthetic. The fill aesthetic was set equal to "dex", and so this became the default title of the fill legend. We can change these labels using ylab(), xlab(), and guide() for the legend.

ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
             shape=21,size=2) + 
  scale_fill_manual(values=c("purple", "yellow"), 
                    labels=c('treated','untreated'))+ 
  #can change labels of fill levels along with colors
  xlab("Recovered transcripts per sample") + #add x label
  ylab("Total sequences per sample") #add y label

Let's change the legend title.

ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
             shape=21,size=2) + 
  scale_fill_manual(values=c("purple", "yellow"), 
                    labels=c('treated','untreated'))+ 
  #can change labels of fill levels along with colors
  xlab("Recovered transcripts per sample") + #add x label
  ylab("Total sequences per sample") +#add y label
  guides(fill = guide_legend(title="Treatment"))

We can modify the axes scales of continuous variables using scale_x_contiuous() and scale_y_continuous(). Discrete (categorical variable) axes can be modified using scale_x_discrete() and scale_y_discrete().

ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
             shape=21,size=2) + 
  scale_fill_manual(values=c("purple", "yellow"), 
                    labels=c('treated','untreated'))+ 
  #can change labels of fill levels along with colors
  xlab("Recovered transcripts per sample") + #add x label
  ylab("Total sequences per sample") +#add y label
  guides(fill = guide_legend(title="Treatment")) + #label the legend
  scale_y_continuous(breaks=seq(1.0e7, 3.5e7, by = 2e6),
                     limits=c(1.0e7,3.5e7)) #change breaks and limits

#maybe we want this on a logarithmic scale
ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
             shape=21,size=2) + 
  scale_fill_manual(values=c("purple", "yellow"), 
                    labels=c('treated','untreated'))+ 
  #can change labels of fill levels along with colors
  xlab("Recovered transcripts per sample") + #add x label
  ylab("Total sequences per sample") +#add y label
  guides(fill = guide_legend(title="Treatment")) + #label the legend
  scale_y_continuous(trans="log10") #use the trans argument

Finally, we can change the overall look of non-data elements of our plot (titles, labels, fonts, background, gridlines, and legends) by customizing ggplot2 themes. Check out ?ggplot2::theme(). For a list of available parameters. ggplot2 provides 8 complete themes, with theme_gray() as the default theme.
ggplot2 complete themes You can also create your own custom theme and then apply it to all figures in a plot.

Create a custom theme to use with multiple figures.

#Setting a theme
my_theme <-
    theme_bw() +
      theme(
        panel.border = element_blank(),
        axis.line = element_line(),
        panel.grid.major = element_line(size = 0.2),
        panel.grid.minor = element_line(size = 0.1),
        text = element_text(size = 12),
        legend.position = "bottom",
        axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1)
      )



ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
             shape=21,size=2) + 
  scale_fill_manual(values=c("purple", "yellow"), 
                    labels=c('treated','untreated'))+ 
  #can change labels of fill levels along with colors
  xlab("Recovered transcripts per sample") + #add x label
  ylab("Total sequences per sample") +#add y label
  guides(fill = guide_legend(title="Treatment")) + #label the legend
  scale_y_continuous(trans="log10") + #use the trans argument
  my_theme

Saving plots (`ggsave()`)

Finally, we have a quality plot ready to publish. The next step is to save our plot to a file. The easiest way to do this with ggplot2 is ggsave(). This function will save the last plot that you displayed by default. Look at the function parameters using ?ggsave().

ggsave("Plot1.png",width=5.5,height=3.5,units="in",dpi=300)

Nice plot example

These steps can be used to create a publish worthy figure. For example, let's create a volcano plot of our differential expression results.

A volcano plot is a type of scatterplot that shows statistical significance (P value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes. --- Maria Doyle, 2021

#get the data
dexp_sigtrnsc<-dexp %>%  
  mutate(Significant = FDR < 0.05 & abs(logFC) >= 2) %>% arrange(FDR)
topgenes<-dexp_sigtrnsc$transcript[c(1:6)]

Plot

#install.packages(ggrepel)
library(ggrepel)
ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = log10(FDR))) +
  geom_point(aes( color = Significant, size = Significant, 
                  alpha = Significant)) +
  geom_text_repel(data=dexp_sigtrnsc %>% 
                    filter(transcript %in% topgenes), 
                  aes(label=transcript),
                  nudge_y=0.5,hjust=0.5,direction="y",
                  segment.color="gray") +
   scale_y_reverse(limits=c(0,-7))+
  scale_color_manual(values = c("black", "#e11f28")) +
  scale_size_discrete(range = c(0, 2)) +
  guides(size = "none", alpha= "none")+
  my_theme

## Warning: Using size for a discrete variable is not advised.

## Warning: Using alpha for a discrete variable is not advised.

Recommendations for creating publishable figures

(Inspired by Visualizing Data in the Tidyverse, a Coursera lesson)

Consider whether the plot type you have chosen is the best way to convey your message
Make your plot visually appealing
- Careful color selection - color blind friendly if possible
- Eliminate unnecessary white space
- Carefully choose themes including font types
Label all axes with concise and informative labels
- These labels should be straight forward and adequately describe the data
Ask yourself "Does the data make sense?"
- Does the data plotted address the question you are answering?
Try not to mislead the audience
- Often this means starting the y-axis at 0
- Keep axes consistent when arranging facets or multiple plots
Do not try to convey too much information in the same plot
- Keep plots fairly simple

Complementary packages

There are many complementary R packages related to creating publishable figures using ggplot2. Check out the packages cowplot and ggpubr. Cowplot is particularly great for providing functions that facilitate arranging multiple plots in a grid panel. Usually publications restrict the number of figures allowed, and so it is helpful to be able to group multiple figures into a single figure panel. GGpubr is particularly great for beginners, providing easy code to make publish worthy figures. It is particularly great for stats integration and easily incorporating brackets and p-values for group comparisons.

Exporting files from RStudio

Remember, because we are using RStudio server through DNAnexus, any files created by you today will be erased at the end of the session.

To use the materials you generated on the RServer on DNAnexus on your local computer, you will need to export your files using More and Export in the Files pane.

Acknowledgements

Material from this lesson was adapted from Chapter 3 of R for Data Science and from a 2021 workshop entitled Introduction to Tidy Transciptomics by Maria Doyle and Stefano Mangiola.