Introduction to ggplot2 for R Data Visualization

Learning Objectives

Identify and describe the core components of a ggplot2 plot, including data, aesthetics, and geometric layers.
Learn the grammar of graphics for plot construction.
Construct basic plots in ggplot2 by mapping variables to aesthetics and adding simple geometric layers.

To get started with this lesson, you will first need to connect to RStudio on Biowulf. To connect to NIH HPC Open OnDemand, you must be on the NIH network. Use the following website to connect: https://hpcondemand.nih.gov/. Then follow the instructions outlined here.

Why use R for Data Visualization?

Learning R and associated plotting packages is a great way to generate publishable figures in a reproducible fashion.

With R you can:
1. Create simple or complex figures.
2. Create high resolution figures.
3. Generate scripts that can be reused to create the same or similar plot.

Why not use Excel for data visualization?

Excel is a great program for managing data in a spreadsheet. However, it isn't great for working with "big data". Large data sets are difficult to work with, and resulting plots are generally not publishable due to a low resolution. Learning R and associated plotting packages is a great way to generate publishable figures in a reproducible fashion. Using R will not only keep you from accidentally editing your data, but it will also allow you to generate scripts that can be viewed later or reused to generate the same plot using different data. This will keep you from having to rely on your memory when wondering what data was used or how a plot was generated.

ggplot2 is an R graphics package from the tidyverse collection. It allows the user to create informative plots quickly by using a 'grammar of graphics' implementation, which is described as "a coherent system for describing and building graphs" (R4DS). The power of this package is that plots are built in layers and few changes to the code result in very different outcomes. This makes it easy to reuse parts of the code for very different figures.

Being a part of the tidyverse collection, ggplot2 works best with data frames (tidy data), which you should already be accustomed to.

To begin plotting, let's load our tidyverse library.

#load libraries
library(tidyverse) # Tidyverse automatically loads ggplot2

-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.4     v readr     2.1.5
v forcats   1.0.0     v stringr   1.5.1
v ggplot2   3.5.2     v tibble    3.3.0
v lubridate 1.9.4     v tidyr     1.3.1
v purrr     1.0.4     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Example Data

We also need some data to plot, so if you haven't already, let's load the data we will need for this lesson.

Getting the Data

If you have not already done so, please download the data for this course from here and unzip it to your working directory.

If you are using RStudio on Biowulf, you can use the following steps to download and unzip the data directly to your working directory.

Open the "Terminal" in RStudio (See the tab next to "Console").
Make sure you are in your working directory. You can check this by typing pwd and hitting enter. If you are not in your working directory, you can change to it using the cd command. For example, if your working directory is /data/username/, you would type cd /data/username/ and hit enter.

Download the data using the wget command:

wget https://bioinformatics.ccr.cancer.gov/docs/r_for_novices/Data_Visualization_with_R/data.zip`

Unzip the data using the unzip command:
```
unzip data.zip 
```

Alternatively, you can download the data to your local machine and then upload it to your working directory in RStudio using the "Upload" button in the "Files" tab.

#scaled_counts data
scaled_counts<-
  read_delim("./data/filtlowabund_scaledcounts_airways.txt")

Rows: 127408 Columns: 18
-- Column specification --------------------------------------------------------
Delimiter: "\t"
chr (11): feature, SampleName, cell, dex, albut, Run, Experiment, Sample, Bi...
dbl  (6): sample, counts, avgLength, TMM, multiplier, counts_scaled
lgl  (1): .abundant

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.

dexp<-read_delim("./data/diffexp_results_edger_airways.txt")

Rows: 15926 Columns: 10
-- Column specification --------------------------------------------------------
Delimiter: "\t"
chr (4): feature, albut, transcript, ref_genome
dbl (5): logFC, logCPM, F, PValue, FDR
lgl (1): .abundant

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.

The example data we will use for today's lesson were generated from data available in the Bioconductor package airway, which "provides a RangedSummarizedExperiment object of read counts in genes for an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethasone" and reported in Himes et al. (2014).

In this experiment, the authors compared transcriptomic differences in primary human airway smooth muscle cell lines treated with dexamethasone, a common therapy for asthma. Each cell line included a treated and untreated negative control resulting in a total sample size of 8.

Practice Data

There are a number of built-in data sets available for practicing with ggplot2. Check these out here!

For example, mtcars is commonly used in ggplot2 documentation:

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +     
  geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Occasionally, I will pull in practice data to demonstrate specific aspects of ggplot2.

The ggplot2 template

The following represents the basic ggplot2 template.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

We need three basic components to create a plot:

data we want to plot
geom function(s)
mapping aesthetics

Notice the + symbol following the ggplot() function. This symbol will precede each additional layer of code for the plot, and it is important that it is placed at the end of the line. More on geom functions and mapping aesthetics to come.

Let's see this template in practice.

We will examine the relationship between the total transcript sums per sample (total reads) and the number of recovered transcripts per sample.

We can generate these data using

sc <- scaled_counts |> group_by(dex, SampleName) |> 
  summarize(Num_transcripts=sum(counts>100),TotalCounts=sum(counts))

`summarise()` has grouped output by 'dex'. You can override using the `.groups`
argument.

sc

# A tibble: 8 x 4
# Groups:   dex [2]
  dex   SampleName Num_transcripts TotalCounts
  <chr> <chr>                <int>       <dbl>
1 trt   GSM1275863           10768    18783120
2 trt   GSM1275867           10051    15144524
3 trt   GSM1275871           11658    30776089
4 trt   GSM1275875           10900    21135511
5 untrt GSM1275862           11177    20608402
6 untrt GSM1275866           11526    25311320
7 untrt GSM1275870           11425    24411867
8 untrt GSM1275874           11000    19094104

Let's plot

ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts))

We can easily see that there is a relationship between the number of reads per sample and the total transcripts recovered per sample. ggplot2 default parameters are great for exploratory data analysis. But, with only a few tweaks, we can make some beautiful, publishable figures.

What did we do in the above code?
The first step to creating this plot was initializing the ggplot object using the function ggplot(). Remember, we can look further for help using ?ggplot(). The function ggplot() takes data, mapping, and further arguments. However, none of these need to actually be provided at the initialization phase, which creates the coordinate system from which we build our plot. But, typically, you should at least call the data at this point.

The data we called was from the data frame sc, which we created above. Next, we provided a geom function (geom_point()), which created a scatter plot. This scatter plot required mapping information, which we provided for the x and y axes. More on this in a moment.

Let's break down the individual components of the code.

#What does running ggplot() do?
ggplot(data=sc)

#What about just running a geom function?
geom_point(data=sc,aes(x=Num_transcripts, y = TotalCounts))

mapping: x = ~Num_transcripts, y = ~TotalCounts 
geom_point: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity

#what about this
ggplot() +
geom_point(data=sc,aes(x=Num_transcripts, y = TotalCounts))

Geom functions

A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. --- R4DS

There are multiple geom functions that change the basic plot type or the plot representation.

scatter plots (geom_point()),
line plots (geom_line(),geom_path()),
bar plots (geom_bar(), geom_col()),
line modeled to fitted data (geom_smooth()),
heat maps (geom_tile()) (Tip: Use ComplexHeatmap or pheatmap),
geographic maps (geom_polygon()), etc.

ggplot2 provides over 40 geoms, and extension packages provide even more (see https://exts.ggplot2.tidyverse.org/gallery/ for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at https://posit.co/resources/cheatsheets/. --- R4DS

You can also see a number of options pop up when you type geom into the console, or you can look up the ggplot2 documentation in the help tab. For more detailed reference pages and examples, see the ggplot2 website reference pages.

Create a line plot

We can see how easy it is to change the way the data is plotted. Let's plot the same data using geom_line().

ggplot(data=sc) + 
  geom_line(aes(x=Num_transcripts, y = TotalCounts))

Create a box plot

Let's plot the same data using geom_boxplot(). A boxplot can be used to summarize the distribution of a numeric variable across groups.

ggplot(data=sc) + 
  geom_boxplot(aes(x=dex, y = TotalCounts))

Note

This time we also modified the x argument.

Mapping and aesthetics (`aes()`)

The geom functions require a mapping argument. The mapping argument includes the aes() function, which "describes how variables in the data are mapped to visual properties (aesthetics) of geoms" (ggplot2 R Documentation). If not included it will be inherited from the ggplot() function.

An aesthetic is a visual property of the objects in your plot.---R4DS

Mapping aesthetics include some of the following:
1. the x and y data arguments
2. shapes
3. color
4. fill
5. size
6. linetype
7. alpha

This is not an all encompassing list. You can add multiple aesthetics to a plot to represent different variables.

Map a Color to a Variable

Let's return to our plot above. Is there a relationship between treatment ("dex") and the number of transcripts or total counts?

#adding the color argument to our mapping aesthetic
ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,color=dex))

There is potentially a relationship. ASM cells treated with dexamethasone in general have lower total numbers of transcripts and lower total counts.

Notice how we changed the color of our points to represent a variable, in this case. To do this, we set color equal to 'dex' within the aes() function. This mapped our aesthetic, color, to a variable we were interested in exploring ("dex"). Aesthetics that are not mapped to our variables are placed outside of the aes() function. These aesthetics are manually assigned and do not undergo the same scaling process as those within aes().

For example,

#map the shape aesthetic to the variable "dex"
#use the color purple across all points (NOT mapped to a variable)
ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,shape=dex),
             color="purple")

We can also see from this that 'dex' could be mapped to other aesthetics. In the above example, we see it mapped to shape rather than color. By default, ggplot2 will only map six shapes at a time, and if your number of categories goes beyond 6, the remaining groups will go unmapped. This is by design because it is hard to discriminate between more than six shapes at any given moment. This is a clue from ggplot2 that you should choose a different aesthetic to map to your variable. However, if you choose to ignore this functionality, you can manually assign more than six shapes.

We could have just as easily mapped it to alpha, which adds a gradient to the point visibility by category.

#map the alpha aesthetic to the variable "dex"
#use the color purple across all points (NOT mapped to a variable)
ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,alpha=dex),
             color="purple") #note the warning.

Warning: Using alpha for a discrete variable is not advised.

Or we could map it to size. There are multiple options, so feel free to explore a little with your plots.

Defaults

Notice that the assignment of color, shape, or alpha to our variable was automatic, with a unique aesthetic level representing each category (i.e., 'Dexamethasone', 'none') within our variable. Most of what we see on this plot is auto generated with defaults (e.g., Assigned colors, legend, axis titles, plot background, tick marks and labels) and we can change these defaults, for example, what colors are used, by adding additional layers to our code.

R objects can also store figures

As we have discussed, R objects are used to store things created in R to memory. This includes plots created with ggplot2.

scatter_plot<-ggplot(data=sc) + 
  geom_point(aes(x=Num_transcripts, y = TotalCounts,
                 color=dex)) 

scatter_plot

We can add additional layers directly to our object.

How can we modify colors?

Colors are assigned to the fill and color aesthetics in aes(). We can change the default colors by providing an additional layer to our figure. To change the color, we use the scale_color functions:

scale_color_manual(),
scale_color_brewer(),
scale_color_grey(), etc.

Example:

ggplot(sc) +
  geom_point(aes(x=Num_transcripts, y = TotalCounts, 
                 color=dex)) +
  scale_color_manual(values=c("red","black"),
                     labels=c('treated','untreated'))

Similarly, if we want to change the fill, we would use the scale_fill options. To modify shapes, use scale_shape options.

Additional arguments

We can modify the behavior of any function by adding additional arguments (if available). Here we changed the color labels in the legend using the labels argument. The labels must be in the correct order. You do not want to mislabel the legend.

Order of Categorical Variables

By default, ggplot2 will alphabetize categorical variables. If you want to change the order of a categorical variable, you can do so by converting the variable to a factor and specifying the levels in the order you want them to appear. The package forcats has a number of functions to help you work with factors. See the forcats documentation for more information.

More on Colors

There are a number of ways to specify the color argument including by name, number, and hex code. Here is a great resource from the R Graph Gallery for assigning colors in R.

There are also a number of complementary packages in R that expand our color options.

viridis - provides colorblind friendly palettes.
randomcoloR - generates large numbers of random colors.
Paletteer - contains a comprehensive set of color palettes to load the palettes from multiple packages all at once.

library(viridis)

Loading required package: viridisLite

ggplot(sc) +
  geom_point(aes(x=Num_transcripts, y = TotalCounts, 
                 color=dex)) + 
scale_color_viridis(discrete=TRUE, option="viridis")

A way to add variables to a plot beyond mapping them to an aesthetic is to use facets or subplots. There are two primary functions to add facets, facet_wrap() and facet_grid(). If faceting by a single variable, use facet_wrap(). If multiple variables, use facet_grid(). The first argument of either function is a formula, with variables separated by a ~ (See below). Variables must be discrete (not continuous). In newer versions of ggplot2, you can additionally use vars() to select variables for faceting. See ?facet_wrap() for more information.

Using ~ in ggplot2

The ~ is used in R formulas to split the dependent or response variable from the independent variable(s). For more information, see this explanation here.

In facet_wrap() / facet_grid() the ~ is used to generate a formula specifying rows by columns.

Let's return to the airway count data to see how facets are useful. Here, we are going to compare scaled and unscaled count data using a density plot.

A density plot shows the distribution of a numeric variable. --- R Graph Gallery

In our example data, density_data, the gene counts were scaled to account for technical and composition differences using the trimmed mean of M values (TMM) from EdgeR (Robinson and Oshlack 2010), but non-normalized values remained for comparison. Thus, we can compare scaled vs unscaled counts by sample using faceting.

Let's import and examine the data with head().

density_data<-read.csv("./data/density_data.csv",
                       stringsAsFactors=TRUE)

head(density_data)

          feature sample SampleName   cell   dex albut        Run avgLength
1 ENSG00000000003    508 GSM1275862 N61311 untrt untrt SRR1039508       126
2 ENSG00000000003    508 GSM1275862 N61311 untrt untrt SRR1039508       126
3 ENSG00000000419    508 GSM1275862 N61311 untrt untrt SRR1039508       126
4 ENSG00000000419    508 GSM1275862 N61311 untrt untrt SRR1039508       126
5 ENSG00000000457    508 GSM1275862 N61311 untrt untrt SRR1039508       126
6 ENSG00000000457    508 GSM1275862 N61311 untrt untrt SRR1039508       126
  Experiment    Sample    BioSample transcript ref_genome .abundant      TMM
1  SRX384345 SRS508568 SAMN02422669     TSPAN6       hg38      TRUE 1.055278
2  SRX384345 SRS508568 SAMN02422669     TSPAN6       hg38      TRUE 1.055278
3  SRX384345 SRS508568 SAMN02422669       DPM1       hg38      TRUE 1.055278
4  SRX384345 SRS508568 SAMN02422669       DPM1       hg38      TRUE 1.055278
5  SRX384345 SRS508568 SAMN02422669      SCYL3       hg38      TRUE 1.055278
6  SRX384345 SRS508568 SAMN02422669      SCYL3       hg38      TRUE 1.055278
  multiplier        source abundance
1   1.415149        counts  679.0000
2   1.415149 counts_scaled  960.8864
3   1.415149        counts  467.0000
4   1.415149 counts_scaled  660.8748
5   1.415149        counts  260.0000
6   1.415149 counts_scaled  367.9388

Notice the source column, which indicates whether the counts are scaled or unscaled. These data are in long vs wide format. You may need to reshape the data to represent the information in a specific way with ggplot2. Here, we can use this variable to facet our density plot.

#plot 
ggplot(data= density_data)+ #initialize ggplot
  geom_density(aes(x=abundance, color=SampleName)) + #call density plot geom
  facet_wrap(~source) + #use facet_wrap
  scale_x_log10()#scales the x axis using a base-10 log transformation

Warning in scale_x_log10(): log-10 transformation introduced infinite values.

Warning: Removed 140 rows containing non-finite outside the scale range
(`stat_density()`).

The distributions of sample counts did not differ greatly between samples before scaling, but regardless, we can see that the distributions are more similar after scaling.

Here, faceting allowed us to visualize multiple features of our data. We were able to see count distributions by sample as well as normalized vs non-normalized counts.

Note the help options with ?facet_wrap(). How would we make our plot facets vertical rather than horizontal?

ggplot(data= density_data)+ #initialize ggplot
  geom_density(aes(x=abundance, 
             color=SampleName)) + #call density plot geom
  facet_grid(~source, ncol=1) + #use the ncol argument
  scale_x_log10()

Warning in scale_x_log10(): log-10 transformation introduced infinite values.

Warning: Removed 140 rows containing non-finite outside the scale range
(`stat_density()`).

Building upon our template

This is the grammar of graphics. Adding layers to create unique figures.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
  ) +
  <FACET_FUNCTION>

Note that there are a lot of invisible (default) layers that often go into each ggplot2, and there are ways to customize these layers. See this chapter from R for Data Science for more information on the grammar of graphics.

Labels, legends, scales, and themes

How do we ultimately get our figures to a publishable state? The bread and butter of pretty plots really falls to the additional non-data layers of our ggplot2 code. These layers will include code to label the axes, scale the axes, and customize the legends and theme. We will be working with these additional plot features in the weeks to come, so stay tuned.

Resource list

Acknowledgements

Material from this lesson was inspired by Chapter 3 of R for Data Science and from "Data Visualization", Introduction to data analysis with R and Bioconductor, which is part of the Carpentries Incubator.