Data Visualization with ggplot2
Learning Objectives
-
Understand the ggplot2 syntax.
-
Learn the grammar of graphics for plot construction.
-
Create simple, pretty, and effective figures.
What is R?
R is a computational language and environment for statistical computing and graphics.
Advantages of R programming:
- open-source
- extensible (Packages on CRAN (> 19,000 packages), Github, Bioconductor)
- Wide community
- allows reproducibility (R scripts, Rmarkdown, Quarto).
- includes fantastic options for data viz (base R, ggplot2, lattice, plotly)
RStudio
An integrated development environment (IDE) for R, and now python. RStudio includes a console, editor, and tools for plotting, history, debugging, and work space management.
What is ggplot2?
An R graphics package from the tidyverse collection, which are popular packages for data science that work really well with data organized in data frames (or tibbles).
Why ggplot2?
- Widespread popularity.
- Used to create informative plots quickly.
- Used to create high resolution plots.
- Used to customize many package specific plots.
- Over 100 related extensions
Outside of base R plotting, one of the most popular packages used to generate graphics in R is ggplot2
, which is associated with a family of packages collectively known as the tidyverse. GGplot2
allows the user to create informative plots quickly by using a 'grammar of graphics' implementation, which is described as "a coherent system for describing and building graphs" R4DS. We will see this in action shortly. The power of this package is that plots are built in layers and few changes to the code result in very different outcomes. This makes it easy to reuse parts of the code for very different figures.
Being a part of the tidyverse collection, ggplot2
works best with data organized so that individual observations are in rows and variables are in columns.
Getting started with ggplot2
To begin plotting, we need to load the ggplot2
package. R packages are loadable extensions that contain code, data, documentation, and tests in a standardized shareable format that can easily be installed by R users.
R packages must be loaded from your R library every time you open and use R. If you haven't yet installed the ggplot2 package on your local machine, you will need to do that using install.packages("ggplot2")
.
#load the ggplot2 library; you could also load library(tidyverse)
library(ggplot2)
Getting help
The R community is extensive and getting help is now easier than ever with a simple web search. If you can't figure out how to plot something, give a quick web search a try. Great resources include internet tutorials, R bookdowns, and stackoverflow. You should also use the help features within RStudio to get help on specific functions or to find vignettes. Try entering ggplot2
in the help search bar in the lower right panel under the Help
tab.
Resources for Learning
Example Data
The example data we will use for plotting are from a bulk RNA-Seq experiment described by Himes et al. (2014) and available in the Bioconductor package airway. In this experiment, the authors were comparing transcriptomic differences in primary human ASM cell lines treated with dexamthasone, a common therapy for asthma. Each cell line included a treated and untreated negative control resulting in a total sample size of 8.
#data import from excel
exdata<-readxl::read_xlsx("./data/RNASeq_totalcounts_vs_totaltrans.xlsx",
1,.name_repair = "universal", skip=3)
exdata
# A tibble: 8 × 4
Sample.Name Treatment Number.of.Transcripts Total.Counts
<chr> <chr> <dbl> <dbl>
1 GSM1275863 Dexamethasone 10768 18783120
2 GSM1275867 Dexamethasone 10051 15144524
3 GSM1275871 Dexamethasone 11658 30776089
4 GSM1275875 Dexamethasone 10900 21135511
5 GSM1275862 None 11177 20608402
6 GSM1275866 None 11526 25311320
7 GSM1275870 None 11425 24411867
8 GSM1275874 None 11000 19094104
These derived data include total transcript read counts summed by sample and the total number of transcripts recovered by sample that had at least 100 reads.
Get the data
You can grab this file here.
Practice Data
There are a number of built-in data sets available for practicing with ggplot2. Check these out here!
For example, mtcars
is commonly used in ggplot2 documentation:
The ggplot2 template
The basic ggplot2 template:
ggplot(data = DATA) +
GEOM_FUNCTION(mapping = aes(<MAPPINGS>))
The only required components to begin plotting are the data we want to plot, geom function(s), and mapping aesthetics. Notice the +
symbol following the ggplot()
function. This symbol will precede each additional layer of code for the plot, and it is important that it is placed at the end of the line. More on geom functions and mapping aesthetics to come.
Using the template
To get familiar with the basic ggplot2 template, lets answer the following question:
What is the relationship between total transcript sums per sample and the number of recovered transcripts per sample?
We can plot using:
#let's plot our data
ggplot(data=exdata) +
geom_point(aes(x=Number.of.Transcripts, y = Total.Counts))
How did we create this plot?
The first step in creating this plot was initializing the ggplot object using the function ggplot()
. Remember, we can look further for help using ?ggplot()
. The function ggplot()
takes data, mapping, and further arguments. However, none of this needs to actually be provided at the initialization phase, which creates the coordinate system from which we build our plot. But, typically, you should at least call the data at this point.
The data we called was from the data frame exdata
, which we created above. Next, we provided a geom function (geom_point()
), which created a scatter plot. This scatter plot required mapping information, which we provided for the x and y axes. More on this in a moment.
Geom functions
A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. --- R4DS
There are multiple geom functions (>40 in ggplot2) that change the basic plot type or the plot representation.
- scatter plots (
geom_point()
), - line plots (
geom_line()
,geom_path()
), - bar plots (
geom_bar()
,geom_col()
), - line modeled to fitted data (
geom_smooth()
), - heat maps (
geom_tile()
), - geographic maps (
geom_polygon()
), etc.
You can also see a number of options pop up when you type geom
into the script pane of RStudio, or you can look up the ggplot2
documentation in the help tab.
Changing the Geom function
We can see how easy it is to change the way the data is plotted. Let's plot the same data using geom_line()
.
Creating a line plot
ggplot(data=exdata) +
geom_line(aes(x=Number.of.Transcripts, y = Total.Counts))
Here we can see one of the advantages of ggplot2, which is that it is easy to change the overall plot representation with small edits to the code.
Creating a boxplot
Let's plot the same data using geom_boxplot()
.A boxplot can be used to summarize the distribution of a numeric variable across groups.
ggplot(data=exdata) +
geom_boxplot(aes(x=Treatment, y = Total.Counts))
This time we also modified the x
argument.
Mapping and aesthetics (aes()
)
The geom functions require a mapping argument. The mapping argument includes the aes()
function, which "describes how variables in the data are mapped to visual properties (aesthetics) of geoms" (ggplot2 R Documentation). If not included it will be inherited from the ggplot()
function.
An aesthetic is a visual property of the objects in your plot.---R4DS
Mapping aesthetics include some of the following:
- the x and y data arguments
- shapes
- color
- fill
- size
- linetype
- alpha
This is not an all encompassing list of mapping aesthetics.
Map a Color to a Variable
Now that we know what we mean by "aesthetics", let's map color to a variable within the data.
Is there a relationship between treatment ("dex") and the number of transcripts or total counts?
#adding the color argument to our mapping aesthetic
ggplot(exdata) +
geom_point(aes(x=Number.of.Transcripts, y = Total.Counts,
color=Treatment))
Notice how we changed the color of our points to represent the variable "Treatment". We did this by setting color equal to 'Treatment' within the aes()
function. This mapped our aesthetic, color, to a variable we were interested in exploring.
From this, we can see that there is potentially a relationship between treatment and the number of transcripts or total counts. ASM cells treated with dexamethasone in general have lower total numbers of transcripts and lower total counts.
Changing the color of all points
Aesthetics that are not mapped to our variables are placed outside of the aes()
function. These aesthetics are manually assigned and do not undergo the same scaling process as those within aes()
. For example, we can color all points on the plot purple.
#map the shape aesthetic to the variable "dex"
#use the color purple across all points (NOT mapped to a variable)
ggplot(exdata) +
geom_point(aes(x=Number.of.Transcripts, y = Total.Counts,
shape=Treatment), color="purple")
Here, we also mapped 'Treatment' to an aesthetic other than color, shape. By default, ggplot2 will only map six shapes at a time, and if your number of categories goes beyond 6, the remaining groups will go unmapped. This is by design because it is hard to discriminate between more than six shapes at any given moment. This is a clue from ggplot2 that you should choose a different aesthetic to map to your variable. However, if you choose to ignore this functionality, you can manually assign more than six shapes.
We could have just as easily mapped "Treatment" to alpha, which adds a gradient to the point visibility by category, or we could map it to size. There are multiple options, so feel free to explore a little with your plots.
Defaults
There are many defaults when generating a plot with ggplot2, but almost everything you see can be customized.
Here we can see:
- Assigned colors
- A legend
- axis titles
- a plot background
- tick marks
The assignment of color, shape, or alpha to our variable occurs automatically, with a unique aesthetic level representing each category (i.e., 'Dexamethasone', 'none') within our variable. Most of what we see on this plot is autogenerated with defaults and we can change these defaults, for example, what colors are used, by adding additional layers to our code.
How can we modify colors?
Colors are assigned to the fill and color aesthetics in aes()
. We can change the default colors by providing an additional layer to our figure. To change the color, we use the scale_color functions:
scale_color_manual()
,scale_color_brewer()
,scale_color_grey()
, etc.
Example:
scatter_plot <- ggplot(exdata) +
geom_point(aes(x=Number.of.Transcripts, y = Total.Counts,
color=Treatment))
scatter_plot +
scale_color_manual(values=c("red","black"),
labels=c('treated','untreated'))
We can also modify the behavior by adding additional arguments. Here we changed the color labels in the legend using the labels
argument.
There are scale functions for other aesthetics (e.g., shape, alpha, line) as well.
More on Colors
There are a number of ways to specify the color argument including by name, number, and hex code. Here is a great resource from the R Graph Gallery for assigning colors in R.
There are also a number of complementary packages in R that expand our color options.
viridis
- provides colorblind friendly palettes.randomcoloR
- generates large numbers of random colors.Paletteer
- contains a comprehensive set of color palettes to load the palettes from multiple packages all at once.
library(viridis)
ggplot(exdata) +
geom_point(aes(x=Number.of.Transcripts, y = Total.Counts,
color=Treatment)) +
scale_color_viridis(discrete=TRUE, option="viridis")
Expanding our ggplot2 template
What do we need to make a plot:
- the data
- one or more geoms
- aesthetic mappings
- facets (i.e., subplots)
- use
facet_grid(), facet_wrap()
- use
- optional parameters that customize our plot (e.g., themes, axis settings, legend settings).
- coordinate systems
- statistical transformations.
The first three line items are required, while the others are controlled by defaults, necessitating additional modification.
Making our plot ready for publication
How do we ultimately get our figures to a publishable state? The bread and butter of pretty plots really falls to the additional non-data layers of our ggplot2 code. These layers will include code to label the axes, scale the axes, and customize the legends and theme.
For example,
ggplot(exdata) +
geom_point(aes(x=Number.of.Transcripts, y = Total.Counts,
fill=Treatment),
shape=21,size=3) +
#can change labels of fill levels along with colors
scale_fill_manual(values=c("purple", "yellow"),
labels=c('treated','untreated'))+
labs(x="Recovered transcripts per sample",
y="Total sequences per sample", fill="Treatment")+
scale_y_continuous(trans="log10") + #log transform the y axis
theme_bw() #add a complete theme black / white
Saving your plot
The easiest way to save our plot with ggplot2 is ggsave()
. This function will save the last plot that you displayed by default. Look at the function parameters using ?ggsave()
.
ggsave("Plot1.png",width=5.5,height=3.5,units="in",dpi=300)
Key Points
- ggplot2 is a popular package for data visualization.
- We learned how to create a plot, change plot types, and add layers for further customization.
- The best way to learn ggplot2 is to use ggplot2.
- Use online resources (e.g., Google) to help you build your plot.
- Reuse your code and modify as needed.
- Check out other resources:
- Email us at ncibtep@nih.gov
- General bioinformatics help
- Training requests
Related packages to check out
There are so many different extensions. Here are a few to check out:
- patchwork - combine multiple plots into a single figure
- ggfortify - autoplot functions for quick and easy plotting
- ggpubr - integrate statistical results
- ggExtra - add subplots along the plot margins
Other packages like EnhancedVolcano can be modified using ggplot2 layers.