Skip to content
PDF

Plot Customization with ggplot2

Learning Objectives

  1. Review the grammar of graphics template.
  2. Understand the statistical transformations inherent to geoms.
  3. Customize figures with labels, legends, scales, and themes.
  4. Save plots with ggsave().

Our grammar of graphics template

Last lesson we discussed the three basic components of creating a ggplot2 plot: the data, one or more geoms, and aesthetic mappings.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

But, we also learned of other features that greatly improve our figures (e.g., facets), and today we will be expanding our ggplot2 template even further to include:

  • one or more datasets,

  • one or more geometric objects that serve as the visual representations of the data, – for instance, points, lines, rectangles, contours,

  • descriptions of how the variables in the data are mapped to visual properties (aesthetics) of the geometric objects, and an associated scale (e. g., linear, logarithmic, rank),

  • a facet specification, i.e. the use of multiple similar subplots to look at subsets of the same data,

  • one or more coordinate systems,

  • optional parameters that affect the layout and rendering, such text size, font and alignment, legend positions.

  • statistical summarization rules

---(Holmes and Huber, 2021)

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>
  ) +
  <FACET_FUNCTION> +
  <COORDINATE SYSTEM> +
  <THEME>

Loading the libraries

To begin plotting, let's load our tidyverse library. This includes ggplot2, which we will be using for plotting.

library(tidyverse) 
-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.4     v readr     2.1.5
v forcats   1.0.1     v stringr   1.5.2
v ggplot2   4.0.0     v tibble    3.3.0
v lubridate 1.9.4     v tidyr     1.3.1
v purrr     1.1.0     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Importing the data

We also need some data to plot, so if you haven't already, let's load the data we will need for this lesson.

#scaled_counts
scaled_counts <-
  read.delim("./data/filtlowabund_scaledcounts_airways.txt", 
             as.is=TRUE)

#differential expression results
dexp <- read.delim("./data/diffexp_results_edger_airways.txt", 
                 as.is=TRUE)  


#transcript counts greater than 100
sc <- read.csv("./data/sc.csv")

Using Multiple Geoms per Plot

In Lesson 1, we discovered that a geom, the geometrical representation of the plot, is required to create a visualization with ggplot2. This is true, but keep in mind that we can use 1 or more geoms to build our plot.

Because we build plots using layers in ggplot2. We can add multiple geoms to a plot to represent the data in unique ways. Let's see how this works.

Let's combine a scatter plot with a line plot.

ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,color=dex)) +
  geom_line(aes(x=Num_transcripts, y = TotalCounts,color=dex))

As you can see, we simply add a new geom, geom_line() to add a line plot.

Global vs local aesthetics

To make our code more effective, we can put shared aesthetics in the ggplot function (ggplot()). Aesthetics in the ggplot() function are global aesthetics, and will be applied to all geoms in the plot. Aesthetics in the geom functions are local aesthetics, and will only be applied to that specific geom.

Setting global aesthetics

ggplot(data=sc, aes(x=Num_transcripts, y = TotalCounts,color=dex)) + 
  geom_point() +
  geom_line()

Geoms can be added in many different ways to create unique representations. Remember, that the layers are ordered, and the order matters for adding new geoms.

Setting local aesthetics

We can plot different aesthetics per geom.

ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,
                 color=SampleName)) +
  geom_line(aes(x=Num_transcripts, y = TotalCounts,color=dex))

Subsetting data per geom

We can represent only a subset of data in one geom and not the other.

 ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,
                 color=SampleName)) +
  geom_line(data=filter(sc,dex=="trt"),
            aes(x=Num_transcripts, y = TotalCounts,color=dex))

To get multiple legends for the same aesthetic, check out the CRAN package ggnewscale. Whereas, legends for different aesthetics can easily be controlled with the scale and guide functions.

Statistical transformations

Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:

  • bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
  • smoothers fit a model to your data and then plot predictions from the model.
  • boxplots compute a robust summary of the distribution and then display a specially formatted box. The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. --- R4DS

Let's plot a bar graph using the data (sc).

#returns an error message. What went wrong?
ggplot(data=sc) + 
  geom_bar( aes(x=Num_transcripts, y = TotalCounts)) 
Error in `geom_bar()`:
! Problem while computing stat.
i Error occurred in the 1st layer.
Caused by error in `setup_params()`:
! `stat_count()` must only have an x or y aesthetic.

An error was returned. What's the difference between stat identity and stat count?

ggplot(data=sc) + 
  geom_bar( aes(x=Num_transcripts, y = TotalCounts), stat="identity") 

As we can see, stat="identity" returns the raw data, stat="count" "counts the number of cases at each x position". You should be aware of the default statistic used by a geom.

Let's look at another example. Here, we are looking at 4 genes of interest from our scaled counts.

#filter our data to include 4 transcripts of interest
keep_t<-c("CPD","EXT1","MCL1","LASP1")
interesting_trnsc<-scaled_counts %>% 
  filter(transcript %in% keep_t) 

#the default here is `stat_count()`, which requires only an x aesthetic
ggplot(data = interesting_trnsc) + 
  geom_bar(mapping = aes(x = transcript, y=counts_scaled)) 
Error in `geom_bar()`:
! Problem while computing stat.
i Error occurred in the 1st layer.
Caused by error in `setup_params()`:
! `stat_count()` must only have an x or y aesthetic.
#remove the y aesthetic
ggplot(data = interesting_trnsc) + 
  geom_bar(mapping = aes(x = transcript)) 

This is not a very useful figure, and probably not worth plotting. We could have gotten this info using str(), as we know we only have 8 samples. However, the point here is that there are default statistical transformations occurring with many geoms, and you can specify alternatives.

Let's change the stat parameter to "identity", and set a fill aesthetic to SampleName. This will plot the raw values of the normalized counts rather than how many rows are present for each transcript.

Note

Setting the color aesthetic in a bar plot results in a colored outline around the bar.

#defaulted to a stacked barplot
ggplot(data = interesting_trnsc) + 
  geom_bar(mapping = aes(x = transcript,y=counts_scaled,
                         fill=SampleName),
           stat="identity",color="black") + 
  facet_wrap(~dex)

Notice that the output is stacked. What if we wanted the columns side by side?

We can again refer to our function arguments. In this case, we can modify position and set to "dodge" (position="dodge"). We can add facets to additionally view by treatment ("dex").

#introducing the position argument, position="dodge"
ggplot(data = interesting_trnsc) + 
  geom_bar(mapping = aes(x = transcript,y=counts_scaled,
                         fill=SampleName),
           stat="identity",color="black",position="dodge") + 
  facet_wrap(~dex)

How do we know what the default stat is for geom_bar()? Well, we could read the documentation, ?geom_bar(). This is true of multiple geoms. The statistical transformation can often be customized, so if the default is not what you need, check out the documentation to learn more about how to make modifications. For example, you could provide custom mapping for a box plot. To do this, see the examples section of the geom_boxplot() documentation.

geom_col()

If we read the documentation for geom_bar(), we see that there is an alternative function for when we want stat="identity" instead of stat="count". That function is geom_col(). By using geom_col, instead of geom_bar, we avoid many of the problems we saw above.

For example,

ggplot(data = interesting_trnsc) + 
  geom_col(mapping = aes(x = transcript,y=counts_scaled,
                        fill=SampleName),
          color="black",position="dodge") + 
  facet_wrap(~dex)

Coordinate systems

ggplot2 uses a default coordinate system (the Cartesian coordinate system). This isn't super important until we want to do something like make a map (See coord_quickmap()) or create a pie chart (See coord_polar()).

When will we have to think about coordinate systems? We likely won't have to modify from default in too many cases (see those above). The most common circumstance in which we will likely need to change the coordinate system is in the event that we want to switch the x and y axes (?coord_flip()) or if we want to fix our aspect ratio (?coord_fixed()). Fixing the aspect ratio is useful when we want to ensure that one unit on the x-axis is the same length as one unit on the y-axis.

#let's return to our bar plot above
#get horizontal bars instead of vertical bars

ggplot(data = interesting_trnsc) + 
  geom_bar(mapping = aes(x = transcript,y=counts_scaled,
                         fill=SampleName),
           stat="identity",color="black",position="dodge") + 
  facet_wrap(~dex) +
  coord_flip()

Note

In the case of a bar plot, coord_flip is no longer required to get this effect. We could instead switch the x and y arguments. You may, however, be interested in using coord_flip with a different geom in the future, so it is nice to be aware of.

Labels, legends, scales, and themes

How do we ultimately get our figures to a publishable state? The bread and butter of pretty plots really falls to the additional non-data layers of our ggplot2 code. These layers will include code to label the axes, scale the axes, and customize the legends and theme.

The default axes and legend titles come from the ggplot2 code. Let's return back to our simple data set, sc, to demonstrate.

ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
             shape=21,size=2) + 
  scale_fill_manual(values=c("purple", "yellow"))

In the above plot, the y-axis label ("TotalCounts") is the variable name mapped to the y aesthetic, while the x-axis label ("Num_transcripts") is the variable name named to the x aesthetic. The fill aesthetic was set equal to "dex", and so this became the default title of the fill legend. We can change these labels using ylab(), xlab(), or labs(), and guide() for the legend.

ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
             shape=21,size=2) + 
  scale_fill_manual(values=c("purple", "yellow"), 
                    labels=c('treated','untreated'))+ 
  labs(x ="Recovered transcripts per sample",
      y="Total sequences per sample")#add x and y labels

titles and subtitles

labs() can also be used to assign a title, subtitle, tags, and caption. See options with ?labs().

Let's change the legend title.

ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
             shape=21,size=2) + 
  scale_fill_manual(values=c("purple", "yellow"), 
                    labels=c('treated','untreated'))+ 
  labs(x ="Recovered transcripts per sample",
      y="Total sequences per sample") + 
  guides(fill = guide_legend(title="Treatment"))

Legend titles can be modified with guides(), labs(), or within the scale function. For example, we could have also modified the legend title in scale_fill_manual() using the name argument.

We can modify the axes scales of continuous variables using scale_x_continuous() and scale_y_continuous(). Discrete (categorical variable) axes can be modified using scale_x_discrete() and scale_y_discrete().

ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
             shape=21,size=2) + 
  scale_fill_manual(values=c("purple", "yellow"), 
                    labels=c('treated','untreated'))+ 
  labs(x ="Recovered transcripts per sample",
      y="Total sequences per sample") +
  guides(fill = guide_legend(title="Treatment")) + #label the legend
  scale_y_continuous(breaks=seq(1.0e7, 3.5e7, by = 2e6),
                     limits=c(1.0e7,3.5e7)) #change breaks and limits

library(scales)

Check out the scales package to make nice axes labels.

Perhaps we want to represent these data on a logarithmic scale.

ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
             shape=21,size=2) + 
  scale_fill_manual(values=c("purple", "yellow"), 
                    labels=c('treated','untreated'))+ 
  labs(x ="Recovered transcripts per sample",
      y="Total sequences per sample") +
  guides(fill = guide_legend(title="Treatment")) + #label the legend
  scale_y_continuous(trans="log10") #use the trans argument

Note

You could manually transform the data without transforming the scales. The figures would be the same, excluding the axes labels. When you use the transformed scale (e.g., scale_y_continuous(trans="log10") or scale_y_log10()), the axis labels remain in the original data space. When the data is transformed manually, the labels will also be transformed.

Finally, we can change the overall look of non-data elements of our plot (titles, labels, fonts, background, grid lines, and legends) by customizing ggplot2 themes. Check out ?ggplot2::theme(). For a list of available parameters. ggplot2 provides 8 complete themes, with theme_gray() as the default theme.
ggplot2 complete themes You can also create your own custom theme and then apply it to all figures in a plot.

Create a custom theme to use with multiple figures.

#Setting a theme
my_theme <-
    theme_bw() +
      theme(
        #Remove the border around the plot
        panel.border = element_blank(),
        # Add the axis lines back in
        axis.line = element_line(),
        #resize the major and minor grid lines
        panel.grid.major = element_line(size = 0.2),
        panel.grid.minor = element_line(size = 0.1),
        #set the text size
        text = element_text(size = 12),
        #Move the legend to the bottom
        legend.position = "bottom",
        #Angle the x axis text
        axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1)
      )
Warning: The `size` argument of `element_line()` is deprecated 
as of ggplot2 3.4.0.
i Please use the `linewidth` argument instead.
ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex),
             shape=21,size=2) + 
  scale_fill_manual(values=c("purple", "yellow"), 
                    labels=c('treated','untreated'))+ 
  labs(x ="Recovered transcripts per sample",
      y="Total sequences per sample") +
  guides(fill = guide_legend(title="Treatment")) + #label the legend
  scale_y_continuous(trans="log10") + #use the trans argument
  my_theme

Saving plots (ggsave())

Finally, we have a quality plot ready to publish. The next step is to save our plot to a file. The easiest way to do this with ggplot2 is ggsave(). This function will save the last plot that you displayed by default. Look at the function parameters using ?ggsave().

ggsave("Plot1.png",width=5.5,height=3.5,units="in",dpi=300)

Acknowledgements

Material from this lesson was inspired by Chapter 3 of R for Data Science and from a 2021 workshop entitled Introduction to Tidy Transciptomics by Maria Doyle and Stefano Mangiola.