Stat Transformations: Bar plots, box plots, and histograms

Objectives

Review the grammar of graphics template
Learn about the statistical transformations inherent to geoms
Review data types
Create bar plots, box & whisker plots, and histograms.

Our grammar of graphics template

one or more datasets,

one or more geometric objects that serve as the visual representations of the data, – for instance, points, lines, rectangles, contours,

descriptions of how the variables in the data are mapped to visual properties (aesthetics) of the geometric objects, and an associated scale (e. g., linear, logarithmic, rank),

a facet specification, i.e. the use of multiple similar subplots to look at subsets of the same data,

one or more coordinate systems,

optional parameters that affect the layout and rendering, such text size, font and alignment, legend positions.

statistical summarization rules,

---(Holmes and Huber, 2021)

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>
  ) +
  <FACET_FUNCTION> +
  <COORDINATE SYSTEM> +
  <THEME>

While we added stat_ellipse() to our PCAs, scatter plots do not have a built in statistical transformation. Today, we will talk more about statistical transformations and how these impact our plot representations.

Libraries

In this lesson, we will be using several external R packages. To utilize these, we need to first load them into our R working environment using the library command.

library(ggplot2) #for plotting
library(tidyr) #for data wrangling
library(stringr) ## will use to wrap text in plots

The Data

In this section we will learn how to construct bar plots using data obtained from a study that examined the effect that dietary supplements at various doses have on guinea pig tooth length. This data set is built into R, so if you want to take a look for yourself, use data("ToothGrowth") either in the console or in a script. In this exercise, we will import the data to our R work environment because it is more likely that we will import our own data for analysis.

Here, we are going to import two data sets using read.delim. The file data2.txt contains summary level information for the tooth growth dataset. The file data1.txt is the raw tooth growth data. We will assign data1.txt to object a1 and data2.txt to object a2. Within read.delim we usesep='\t' to indicate that the columns in the data file are tab separated.

a1 <- read.delim("./data/data1.txt", sep='\t')
a2 <- read.delim("./data/data2.txt", sep='\t')
head(a1)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

head(a2)

##   supp dose treat mean_len       sd
## 1   OJ  0.5 OJ0.5    13.23 4.459709
## 2   OJ  1.0   OJ1    22.70 3.910953
## 3   OJ  2.0   OJ2    26.06 2.655058
## 4   VC  0.5 VC0.5     7.98 2.746634
## 5   VC  1.0   VC1    16.77 2.515309
## 6   VC  2.0   VC2    26.14 4.797731

The tooth growth data set measured tooth length for two supplement types (OJ - orange juice, VC - vitamin c) at three different doses (0.5, 1, and 2). Each supplement and dose combination has 10 measurements so we have a total of 60 measurements in this data set. In a2, we pre-computed the mean tooth length and standard deviation for the 10 measurements taken for each supplement and dose combination. a1 contains the raw data.

The column headings (colnames(a1),colnames(a2)) in the raw data (a1) and summary level data (a2) are as follows:

len (tooth length - in raw data)
supp (supplement type)
dose
treat (treatment, which is a concatenation of supp and dose - in summary level data only)
mean_len (mean tooth length - in summary level data only)
sd (standard deviation - in summary level data only)

Variable Types

Before diving into the construction of bar plot, box & whisker plot, and histogram, we should do a quick review of the types of variables that we commonly work with in data analysis.

Categorical variables are qualitative, for instance
patient gender (male or female)
disease status (case or control)
Discrete variables are quantitative and their values are not on a continuous scale, for instance
number of male or female patients
number of controls or disease cases
Continuous variables can take on an infinite number of values
age
height
weight
Independent variable is a variable whose variation does not depend on another (IS THIS NECESSARY????)
Dependent variable is a variable whose variation depends on another (IS THIS NECESSARY????)
Factors are experimental or study variables and can be categorical or discrete. Each factor contains levels. For instance, in a cell culture experiment, drug treatment would be a factor while the type of drugs used (dexamethasone, albuterol) would be the corresponding levels

The `stat` argument

Until this point we have been plotting raw data with geom_point(), but now we will be introducing geoms that transform and plot new values from your data.

Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:

bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.

smoothers fit a model to your data and then plot predictions from the model.

boxplots compute a robust summary of the distribution and then display a specially formatted box.---R4DS

Let's explore the tooth growth data. The tooth growth data has two independent variables, supplement and dose (variables supp and dose, respectively) for which the dependent variable tooth length (len) is measured. We would like to use the scatter plot to learn about tooth growth as a function of both dose and supplement (supp). Remember from Lesson 3 that a scatter plot can be generated using geom_point.

Within the aesthetic mapping in geom_point, we assign dose to the x axis, len (tooth length) to y, and the second independent variable supp (supplement) by assigning it to color. This will give us a scatter plot where dose is plotted along the x axis and len plotted along the y axis. The color code indicating which supplement (supp) each of the points or measurements came from is provided in the legend. In short, the color argument allows us to visualize how the dependent variable changes as a function of two independent variables.

Within geom_point, position=position_dodge is included to separate the points by supplement (supp), otherwise the points for the two supplements will overlap each other. The parameter width in position_dodge allows us to specify how far apart we want the points for the different supplement groups to be.

ggplot(data=a1)+
  geom_point(mapping=aes(x=dose,y=len,color=supp), position=position_dodge(width=0.25))

It seems that tooth length increases as a function of dose. However, we can see that ggplot2 is trying to interpret dose as continuous numeric data, when it is really discrete. Scatter plots are useful for exploring two numeric data sets. Since this data includes categorical and discrete numeric data (rather than continuous), we may want to use alternative plots for exploratory purposes.

Let's explore various aspects of these data using bar plots, box plots, and histograms.

Bar Plot

A barplot is used to display the relationship between a numeric and a categorical variable. --- R Graph Gallery

The tooth growth data can be visualized via a bar plot using geom_bar(), as dose is a discrete variable that can be treated as categorical on the x-axis.

ggplot(data=a1)+
  geom_bar(mapping=aes(x=dose,y=len))

## Error in `f()`:
## ! stat_count() can only have an x or y aesthetic.

We received an error referencing stat_count(), which geom_bar uses by default.

stat = count

geom_bar() uses stat_count() by default: it counts the number of cases at each x position. --- ggplot2 documentation

stat_count() requires mapping for either an x OR a y variable but not both.

What happens if we only provide the mapping argument for x, as suggested by the error message?

Let's take a look at the a bar plot constructed using the default stat_count transformation. Below, we plot the number of tooth length measurements taken at each dose. Setting color="black" allows us to include a black outline to the bars for better readability. In accordance with the description of this data, we see from the plot that 20 measurements were taken at each dose.

ggplot(data=a1)+
  geom_bar(mapping=aes(x=dose), color="black")

So how many of the 20 measurements taken at each dose came from the OJ or VC group? To find out, we can set fill=supp, which allows us to include a second independent variable in our graph. Note: The use of color in aes() outlines the bars rather than fills them.

The plot below tells us that 10 measurements were taken from the OJ group and 10 were taken from the VC group at each dose.

ggplot(data=a1)+
  geom_bar(mapping=aes(x=dose, fill=supp), color="black")

stat_count() is a great way to compare differences in sample sizes between treatment categories or other variables and can be a great way to explore and represent metadata associated with -omics data.

By default, geom_bar stacks bars from different groups. If we do not like the arrangement, we can use position_dodge to arrange the bars from the OJ and VC groups side-by-side.

ggplot(data=a1)+
  geom_bar(mapping=aes(x=dose,fill=supp), color="black", position=position_dodge())

A stacked bar plot including the number of observations by dose and supplement was returned; these are the number of cases in each group (OJ, VC) by dose (0.5, 1.0, 2.0). If we do not want counts of observations by group but rather the raw data, we need to modify the default statistical transformation. In the case of the raw data, we want stat="identity", which will plot the y-values as given.

stat = identity

Above, we learned the number of tooth length measurements taken at each dose and supplement combination using the default stat_count transformation of geom_bar, but what if we want to specify and plot exactly the value in the y axis? This can be done in geom_bar by setting stat="identity".

Below, we are plotting the mean tooth length (mean_len) across each of the treatment groups (treat) using the summary level data, a2. Using stat="identity", the exact y value or mean_len is plotted rather than counts.

ggplot(data=a2)+
  geom_bar(mapping=aes(x=treat,y=mean_len),stat="identity")

This is better. Using stat identity the raw data (without transformation within ggplot2) were plotted. Note: stat="identity" is the default for geom_col, so we could have simply called geom_col rather than geom_bar for the same output.

What if we wanted to look at the raw values from the non-summarized data across treatment groups using a bar plot? We still use geom_bar but the aesthetic mapping will be similar to the scatter plot except we are filling the bars to provide color coding for the supplements using fill. Again, because we want ggplot2 to plot the exact values in the y axis, we specify stat="identity" inside geom_bar. However, to avoid stacking the values, we can use position_dodge2 in geom_bar to visualize each of the 10 measurements taken at each supplement and dose combination arranged side-by-side.

ggplot(data=a1)+
  geom_bar(mapping=aes(x=dose,y=len,fill=supp),stat="identity",position=position_dodge2(),color="black")

Note: that because R interprets dose as numeric continuous variable (class(a2$dose)) ggplot2 gives us an extra dose of 1.5. But, the study did not measure tooth length at a dose of 1.5 for either of the supplements, which is misleading. Thus, we would like to remove this dose.

Using factors

class(a1$dose)

## [1] "numeric"

As it turns out, dose is really an experimental factor, so if we specify factor(dose) it will be interpreted as a categorical or discrete. Before fixing the x axis in the above plot, let's diverge briefly and speak about factors in R.

Using the factor command we see that there are three levels for dose (0.5, 1, and 2)

factor(a1$dose)

##  [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1   1   1   1   1   1   1   1   1  
## [20] 1   2   2   2   2   2   2   2   2   2   2   0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## [39] 0.5 0.5 1   1   1   1   1   1   1   1   1   1   2   2   2   2   2   2   2  
## [58] 2   2   2  
## Levels: 0.5 1 2

To remove the dose of 1.5, we can set the x axis to factor(dose).

ggplot(data=a1)+
  geom_bar(mapping=aes(x=factor(dose),y=len,fill=supp),stat="identity",position=position_dodge2(),color="black")

We can reorder factors using the factor command. For instance if we want to plot the doses backwards from highest to low on the x axis (ie. 2, 1, 0.5) we can set the x axis in aesthetic mapping of geom_bar to factor(dose,levels=c(2,1,0.5))

ggplot(data=a1)+
  geom_bar(mapping=aes(x=factor(dose, levels=c(2,1,0.5)),y=len,fill=supp),stat="identity",position=position_dodge2(),color="black")

Adding error bars

Some use bar plots to illustrate mean plus / minus standard deviation in the data, so let's take a moment to learn how to incorporate error bars in our data. We will learn to do this using the summary level data (a2) where the mean and standard deviation have been pre-computed and then with the raw data (a1) to take advantage of more statistical transformations.

First, we use the summary level data (a2) to plot the mean tooth length (mean_len) across the treatment groups, remembering to set stat="identity" because we are providing a y axis and position=position_dodge to arrange the bars side-by-side.

ggplot(data=a2, mapping=aes(x=factor(dose),y=mean_len,fill=supp))+
geom_bar(stat="identity", position=position_dodge())

Next, we add geom_errorbar to incorporate error bars that illustrate plus/minus 1 standard deviation from the mean. To set the upper and lower bounds of the error bar, we simply set ymax=mean_len+sd and ymin=mean_len-sd within the aesthetic mapping of geom_errorbar. Setting position=position_dodge in geom_errorbar again allows us to separate the error bars from each of the supplement groups. Within the position_dodge argument we set width=0.9 to help center the error bars with their respective bars. The width parameter within geom_errorbar allows us to adjust the error bar width (set to 0.1 here).

ggplot(data=a2, mapping=aes(x=factor(dose),y=mean_len,fill=supp))+
geom_bar(stat="identity", position=position_dodge())+
geom_errorbar(aes(ymax=mean_len+sd,ymin=mean_len-sd), position=position_dodge(width=0.9), width=0.1)

If you do not have pre-computed means and standard deviations, ggplot2 has a variety of statistical transformations that allow us to incorporate various statistics into a plot. We saw two of the statistical transformations already (ie. count and identity). For a list of statistical transformations available in ggplot2 see https://ggplot2-book.org/layers.html?q=stat#stat. stat="summary", which adds summary level information to plots, can be used to plot this same information.

Histogram

Understanding data distribution can help us decide appropriate downstream steps in analysis such as which statistical test to use. A histogram is a good way to visualize distribution. It divides the data into bins or increments and informs the number of occurrences in each of the bins. Thus, the default statistical transformation for geom_histogram is stat_bin, which bins the data into a user specified integer (default is 30) and then count the occurrences in each bin. In geom_histogram we have the ability to control both the number of bins through the bins argument or binwidth through the binwidth argument. Important to note is that stat_bin only works with continuous variables.

Below we constructed a basic histogram using the len column in a1 (the raw data for the Tooth Growth study). It is not very aesthetically pleasing - there are gaps and it's difficult to see the separation of the bins.

ggplot(data=a1, mapping=aes(x=len))+
geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

First, we will use the color argument in geom_histogram to assign a border color to help distinguish the bins. Then we use the fill argument in geom_histogram to change the bars associated with the bins to a color other than gray. Below we have a histogram of tooth length with a default bin of 30.

ggplot(data=a1, mapping=aes(x=len))+
geom_histogram(color="black", fill="cornflowerblue")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let's alter the number of bins

ggplot(data=a1, mapping=aes(x=len))+
geom_histogram(color="black", fill="cornflowerblue", bins=7)

From the above, we see that altering the number of bins alters the binwidth (ie. the range in which occurrence are counted). (WHY DO WE CARE???)

(WHAT DOES THE HISTOGRAM TELL US ABOUT THE TOOTH DATA?)

Box plot

Box and whisker plots also show data distribution. Unlike a histogram, we can readily see summary statistics such as median, 25th and 75th percentile, and maximum and minimum.

The default statistical tranformation of a box plot in ggplot2 is stat_boxplot which calculates components of the boxplot. To construct a box plot in ggplot2, we use the geom_boxplot argument. Below, we have a default boxplot depicting tooth length across the treatment groups. Potential outliers are presented as points.

ggplot(data=a1, mapping=aes(x=factor(dose), y=len, color=supp))+
geom_boxplot()

Rather than showing the outliers, we could instead add on geom_point to overlay the data points. Here, in geom_boxplot we set outlier.shape=NA to remove the outliers for the purpose of avoiding duplicating a data point. Within geom_point we set the position of the points to position_jitterdodge to avoid overlapping the points whose values are close together (set by jitter.width) and overlapping of points from measurements derived from different supplements (dodge.width).

ggplot(data=a1, mapping=aes(x=factor(dose), y=len, color=supp))+
geom_boxplot(outlier.shape=NA)+
geom_point(position=position_jitterdodge(jitter.width=0.3,dodge.width=0.8))

(WHAT DOES THIS TELL US ABOUT OUR DATA?)