Data Visualization and Statistical Integration with ggpubr
Learning objectives
- Explore the package ggpubR
- Learn about statistical functions accessible within the ggpubr package.
- Learn about options for visualization including how to include comparison brackets, p-values, and other statistical results
Warning
This tutorial is NOT meant to teach you about statistics. If you are not familiar with the statistical tests included, please look for other content on statistical theory. We have linked some content for your convenience.
ggplot2 is a popular R package for data visualization that uses layers to build high quality plots. There are over 100 packages that extend the functionality of ggplot2. This session of the BTEP Coding Club will focus on the package ggpubr
, which facilitates plot customization and statistical integration, making it much easier to create publication ready plots with ggplot2. Specifically, this lesson will demonstrate how to visualize the results of common statistical tests (e.g., t-tests, Wilcoxon rank sum, ANOVA, Pearson correlation).
Package installation and loading
Both ggplot2 and ggpubr can be installed using install.packages(c("ggplot2", "ggpubr"))
. We will also use some additional packages for this tutorial including rstatix
, which I will describe in more detail later, patchwork
, and dplyr
. patchwork
is a package for arranging multiple plots in a single figure, while dplyr
is a tidyverse package for data wrangling. rstatix
, patchwork
, and dplyr
can also be installed using install.packages()
.
library(ggplot2)
library(ggpubr)
library(rstatix)
library(patchwork)
library(dplyr)
ggplot2 review
If you are interested in ggpubr, you are likely familiar with ggplot2. As a reminder, ggplot2 is a popular R graphics package associated with a family of packages known as the tidyverse. Tidyverse packages work effectively on data stored in data frames (or tibbles), which store variables in columns and observations in rows. In ggplot2, plots are built in layers, allowing one to incorporate multiple data sets and advanced features, thus allowing the generation of fairly complex plots.
To build a plot, we need data to plot, one or more geoms to represent the data, and mapping aesthetics, which describe how the variables are mapped to the geoms.
For example, let's plot the length of odontoblasts (tooth growth cells) in response to dose levels of Vitamin C in mg/day from the built in data set ToothGrowth
.
# Load data
data("ToothGrowth") #obtain the data
df <- ToothGrowth #save to object
a<-ggplot(data= df, aes(x=factor(dose),y=len)) + #call data
geom_boxplot(aes(color=factor(dose))) + #add boxplot
geom_jitter(aes(color=factor(dose),shape=factor(dose))) + #add points
theme_classic()+ #use built-in theme
ggtitle("Toothgrowth (ggplot2)") #add title
a
What is ggpubR?
The strength of ggplot2
is also its weakness. Using a layered approach means that it can often take quite a bit of code to reach a publishable state, even for routine plots. For new users, it is additionally difficult to remember what elements are needed to adjust various plot features, as there are hundreds of functions in the ggplot2
package. ggpubr
is a package designed to address some of these problems. The ‘ggpubr’ package provides some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots.
For example, we can use ggpubr
to make the ggplot2 plot above with a single function (?ggboxplot
).
b<-ggboxplot(df, x = "dose", y = "len", #add data, x, and y arguments
color = "dose", #set the color aesthetic to a variable
add = "jitter", shape = "dose", #add jitter points
title="Toothgrowth (ggpubr)") #add a title
b
The data
for ggboxplot
should be organized in a data frame; however, we will see in a moment that the organization isn't as strict as ggplot2
, which generally relies on data organized in long format.
What is a boxplot?
A boxplot summarizes the distribution of a numeric variable across one or more groups, making it a convenient tool for quickly grasping differences between those groups. - From Data to Viz.
From the boxplot we can see the first and third quartiles of data in the lower and upper hinges (25th and 75th percentiles). We can also see the median, range of data, and potential outliers.
Adding a jitter to the boxplot also allows us to visualize the distribution of data points, which would otherwise be hidden.
ggpubr includes a number of functions for different visualizations, not just a boxplot. There are functions for visualizing distributions such as density plots, histograms, empirical cumulative density plots, and QQ plots (e.g., ggdensity()
, stat_overlay_normal_density()
, gghistogram()
, ggecdf()
, ggqqplot()
).
For example, let's see how easy it is to make a density plot.
ggdensity(df, x = "len", #adding the data and x variable
add = "mean", rug = TRUE, #including a line for the mean and rug
fill = "supp",color="supp",facet.by="dose",palette = "npg")
What is a density plot?
A density plot is a representation of the distribution of a numeric variable. It uses a kernel density estimate to show the probability density function of the variable. - from Data to Viz
As you can see, using ggpubr
, we were able to create subplots (facets), a rug plot, and add a summary statistic (the mean). This was done with a single function, ggdensity()
. There are also many options for plot customization, which you can see in the associated help documentation, ?ggdensity()
.
What is a rug plot?
A rug plot displays individual cases along the margins of distributions.
In addition, ggpubr
also includes singular functions for violin plots (ggviolin
), dot plots (ggdotplot
), strip charts (ggstripchart
), bar plots (ggbarplot
), line plots (ggline
), error plots (ggerrorplot
), pie charts (ggpie
), donut charts (ggdonutchart
), Cleveland's Dot Plots (ggdotchart
), and scatter plots. See the ggpubr
package reference for more information.
Statistical Integration
ggpubr
allows us to easily incorporate additional information into our plots (e.g., mean, individual case information, etc.). Where this package really shines is its ability to easily include the results of common statistical tests. Here, I will focus on how this package is useful for displaying the results of common statistical tests such as those used to compare means between groups and those involved in deciphering relationships between variables (e.g., correlation).
Consult a statistician
Bioinformaticians have training in statistics. However, we are not statisticians. I recommend consulting with a statistician for statistics related questions. The Advanced Biomedical Computation Science (ABCS) Group offers no fee statistical support for NCI researchers. If you have a stats question about your specific project, fill out a project request.
This tutorial will use the following data to demo ggpubr
statistical functions.
-
Blood storage data
This data set contains data on 316 men who had undergone radical prostatectomy and received transfusion during or within 30 days of the surgical procedure and had available prostate serum antigen (PSA) follow-up data. The main exposure of interest was RBC storage duration group. A number of demographic, baseline and prognostic factors were also collected. - R package documentation: medicaldata
These data (
?blood_storage
) can be installed withdevtools::install_github("higgi13425/medicaldata")
.library(medicaldata) bdata<-blood_storage glimpse(bdata)
Rows: 316 Columns: 20 $ RBC.Age.Group <dbl> 3, 3, 3, 2, 2, 3, 3, 1, 1, 2, 2, 1, 1, 1, 3, 1, 1, 1,… $ Median.RBC.Age <dbl> 25, 25, 25, 15, 15, 25, 25, 10, 10, 15, 15, 10, 10, 1… $ Age <dbl> 72.1, 73.6, 67.5, 65.8, 63.2, 65.4, 65.5, 67.1, 63.9,… $ AA <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,… $ FamHx <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… $ PVol <dbl> 54.0, 43.2, 102.7, 46.0, 60.0, 45.9, 42.6, 40.7, 45.0… $ TVol <dbl> 3, 3, 1, 1, 2, 2, 2, 3, 2, 2, 1, 3, 2, 2, 2, 2, 1, 2,… $ T.Stage <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, NA, 1… $ bGS <dbl> 3, 2, 3, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1,… $ `BN+` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… $ OrganConfined <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,… $ PreopPSA <dbl> 14.08, 10.50, 6.98, 4.40, 21.40, 5.10, 6.03, 8.70, 3.… $ PreopTherapy <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,… $ Units <dbl> 6, 2, 1, 2, 3, 1, 2, 4, 1, 2, 2, 2, 2, 4, 2, 4, 5, 4,… $ sGS <dbl> 1, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 1, 3, 1, 2, 2,… $ AnyAdjTherapy <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… $ AdjRadTherapy <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… $ Recurrence <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… $ Censor <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… $ TimeToRecurrence <dbl> 2.67, 47.63, 14.10, 59.47, 1.23, 74.70, 13.87, 8.37, …
-
TCGA data (included within the
ggpubr
package)I will also show how these methods, specifically the Wilcoxon Rank Sum and Pearson Correlation, can be applied to gene expression data from The Caner Genome Atlas (TCGA). TCGA is a massive (~2.5 petabytes) and largely open-source dataset comprised of genomic, epigenomic, transcriptomic, and proteomic data combined with rich clinical information and related metadata from over 11,000 patients representing 33 cancer types. Learn more about TCGA here.
For this, we will use
?gene_expression
; a built in data set in theggpubr
package, which includesgene expression data extracted from TCGA using the 'RTCGA' and 'RTCGA.mRNA' R packages. It contains the mRNA expression for 3 genes - GATA3, PTEN and XBP1- from 3 different datasets: Breast invasive carcinoma (BRCA), Ovarian serous cystadenocarcinoma (OV) and Lung squamous cell carcinoma (LUSC) - gene_expression {ggpubr}
exp<-gene_expression glimpse(exp)
Rows: 1,305 Columns: 5 $ bcr_patient_barcode <chr> "BRCA1", "BRCA2", "BRCA3", "BRCA4", "BRCA5", "BRCA… $ dataset <chr> "BRCA", "BRCA", "BRCA", "BRCA", "BRCA", "BRCA", "B… $ GATA3 <dbl> 2.870500, 2.166250, 1.323500, 1.841625, -6.025250,… $ PTEN <dbl> 1.361357143, 0.428357143, 1.305642857, 0.809642857… $ XBP1 <dbl> 2.9833333, 2.5508333, 3.0204167, 3.1313333, -1.451…
I will also use a small subset of lung adenocarcinoma (TCGA-LUAD) data obtained using the package TCGAbiolinks
to demonstrate features of ggpubr
useful for correlation. These data were prepared outside of this tutorial. Feel free to download the prepared data here.
Methods for Comparing Means
The following include statistical tests commonly employed to compare the means of two groups. Importantly, both parametric and non-parametric tests are highlighted. The assumptions underlying parametric tests (e.g., normally distributed data, homogeneity of variance, independence, etc.) are not the same as non-parametric tests. Keep in mind that if the assumptions of an underlying statistical test are not met, the results are likely to be inaccurate. In general, non-parametric tests are more applicable to high throughput data, which rarely meets the assumptions of parametric tests.
Methods | R function | Description |
---|---|---|
T-test | t.test() | Compare two groups (parametric) |
Wilcoxon test | wilcox.test() | Compare two groups (non-parametric) |
ANOVA | aov() or anova() | Compare multiple groups (parametric) |
Kruskal-Wallis | kruskal.test() | Compare multiple groups (non-parametric) |
This table was taken from a tutorial entitled Add P-values and Significance Levels to ggplots.
Note
The T-test or Wilcoxon test can be paired or unpaired. Paired tests are for dependent data.
Functions available in ggpubr
for comparing means
Interested in more tutorials, check out this course on comparing means in R from DATANOVIA.
Using compare_means()
The main function in ggpubr
to compare means between two or more groups is compare_means()
. This function includes methods for each of the tests listed in the above table.
Let's see how this function works. Let's consider our blood storage data set. In order to determine whether blood storage time (young, middle, old) impacted the recurrence of prostate cancer following prostatectomy with perioperative transfusion, the authors first looked at variables that could be univariably significant with different exposure groups (blood storage groups). Variables that were significant could then be accounted for in survival models.
Let's see if patient age significantly differed between the three blood storage groups using an Analysis of Variance test (ANOVA). (Again, don't forget to check your assumptions).
Assumptions Testing
Here is a quick tutorial on checking the assumptions of parametric tests.
#Shapiro-Wilk test is one way to check for normality
shapiro.test(bdata$Age)
Shapiro-Wilk normality test
data: bdata$Age
W = 0.99183, p-value = 0.07886
#Levene's test can be used to check for homogeneity of variance.
car::leveneTest(Age~factor(RBC.Age.Group),center="mean",bdata)
Levene's Test for Homogeneity of Variance (center = "mean")
Df F value Pr(>F)
group 2 0.4671 0.6272
313
compare_means(Age~RBC.Age.Group,bdata,method="anova")
# A tibble: 1 × 6
.y. p p.adj p.format p.signif method
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 Age 0.448 0.45 0.45 ns Anova
compare_means()
Usage:
compare_means(
formula
, # This should be in the format numeric variable ~ factor (group)
data
, # The data to be used for the test (a data frame)
method
= "wilcox.test", # The method to be used (Wilcoxon rank sum, T-test, ANOVA, Kruskal-Wallis).
paired
= FALSE, # Are your data paired?
group.by
= NULL, # Include grouping variables
ref.group
= NULL, # Can assign a reference group to compare all other groups to.
symnum.args = list()
, # See package documentation
p.adjust.method
= "holm", # Can specify the p-value correction method for multiple comparisons (e.g., Holm, Hochberg, Hommel, Bonferroni, FDR, etc.)
...
)
From our results, we can see that there is not a significant difference in patient ages by blood storage group. Notice, this is not nearly as descriptive as using aov()
, and this can only perform a one-way ANOVA. However, it is a convenient function and plays nicely with ggpubr
graphing options, which we will look at in more detail.
What about the number of allogeneic units? Did the number of units a patient received significantly differ between the exposure groups? Here we can use a Kruskal-Wallis by changing the method
argument (method="kruskal.test"
).
compare_means(Units~RBC.Age.Group,bdata,method="kruskal.test")
# A tibble: 1 × 6
.y. p p.adj p.format p.signif method
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 Units 0.0199 0.02 0.02 * Kruskal-Wallis
While an ANOVA or Kruskal-Wallis can tell us there are significant differences between our groups, pairwise testing can then tell us which groups are different. With compare_means()
we can use a T-test or Wilcoxon Rank Sum as a post-hoc for pairwise comparisons.
For example,
wtest<-compare_means(Units~RBC.Age.Group,bdata)
wtest
# A tibble: 3 × 8
.y. group1 group2 p p.adj p.format p.signif method
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 Units 3 2 0.0762 0.15 0.0762 ns Wilcoxon
2 Units 3 1 0.00718 0.022 0.0072 ** Wilcoxon
3 Units 2 1 0.276 0.28 0.2755 ns Wilcoxon
Here, we are making use of default arguments. See ?compare_means()
.
Integrate these results into our visualizations
If we are just interested in quickly adding the p-value of our Kruskal-Wallis, we can use stat_compare_means()
. You do not need to run compare_means()
prior to stat_compare_means()
.
ggboxplot(bdata, x = "RBC.Age.Group", y = "Units",
color = "RBC.Age.Group",
palette =c("#00AFBB", "#E7B800", "#FC4E07")) +
stat_compare_means(label.y = 28)
However, if we want to include multiple comparisons and report p-values adjusted for multiple comparisons, we will need to use compare_means()
or related function (See the section on rstatix
).
ggboxplot(bdata, x = "RBC.Age.Group", y = "Units",
color = "RBC.Age.Group",
palette =c("#00AFBB", "#E7B800", "#FC4E07")) +
stat_compare_means(label.y = 28,) +
stat_pvalue_manual(
wtest, y.position=c(26,22,15),
label = "p.adj"
)
hide.ns
If you have a bunch of non-significant comparisons, hide.ns
will remove these from the results.
R redundancy
There are many ways to reach the same or similar outcome in R, even with ggpubr
. For example, we could have used the following:
ggboxplot(bdata, x = "RBC.Age.Group", y = "Units",
color = "RBC.Age.Group",
palette =c("#00AFBB", "#E7B800", "#FC4E07")) +
stat_compare_means(label.y = 28,) +
stat_pwc( method = "wilcox_test", label = "p.adj.format")
If you want more information on your plot, try one of the functions specific to your statistical test (e.g., stat_kruskal_test()
).
ggboxplot(bdata, x = "RBC.Age.Group", y = "Units",
color = "RBC.Age.Group",
palette =c("#00AFBB", "#E7B800", "#FC4E07")) +
stat_kruskal_test(label = "as_detailed_italic", label.y=17)
TCGA data example
Let's look at a more complicated example using the gene_expression
data. We can use ggpubr
to explore genes of interest. However, in cases where comparisons are more complicated and faceting is used, this can get even more complicated.
Data Reshaping
In our exp
data frame, the results of each gene are in their own column. With ggplot2
, these data would need to be reshaped for effective plotting. However, with ggpubr
, we can include multiple arguments in the y
argument. For example, y = c("GATA3", "PTEN", "XBP1")
. Here, we need to also specify how we would like to merge the results. If we want these variable in the same plotting area, we use merge=TRUE
; alternatively, if we want subplots or facets, we use combine=FALSE
.
Because we are working with multiple comparisons, facets, and adjusted p-values, we can use geom_pwc()
or stat_pws()
. These functions work with box plots, dot plots, and stripcharts.
compare_means(c(GATA3, PTEN, XBP1) ~ dataset, data = exp,
p.adjust.method="bonferroni")
# A tibble: 9 × 8
.y. group1 group2 p p.adj p.format p.signif method
<fct> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 GATA3 BRCA OV 1.11e-177 3.30e-177 < 2e-16 **** Wilcoxon
2 GATA3 BRCA LUSC 6.68e- 73 2 e- 72 < 2e-16 **** Wilcoxon
3 GATA3 OV LUSC 2.97e- 8 8.90e- 8 3.0e-08 **** Wilcoxon
4 PTEN BRCA OV 6.79e- 5 2 e- 4 6.8e-05 **** Wilcoxon
5 PTEN BRCA LUSC 1.04e- 16 3.10e- 16 < 2e-16 **** Wilcoxon
6 PTEN OV LUSC 1.28e- 7 3.8 e- 7 1.3e-07 **** Wilcoxon
7 XBP1 BRCA OV 2.55e-123 7.70e-123 < 2e-16 **** Wilcoxon
8 XBP1 BRCA LUSC 1.95e- 42 5.90e- 42 < 2e-16 **** Wilcoxon
9 XBP1 OV LUSC 4.24e- 11 1.3 e- 10 4.2e-11 **** Wilcoxon
#This does not use output from line above
ggboxplot(exp, x = "dataset",
y = c("GATA3", "PTEN", "XBP1"), # multiple y arguments
combine = TRUE, # adding facets
ylab = "Expression",
color = "dataset",
palette = "jco") +
geom_pwc( #geom for pairwise comparisons
aes(group = dataset),
method = "wilcox_test", label = "p.adj.format",
p.adjust.method = "bonferroni") +
scale_y_continuous(expand = expansion(mult = c(0, 0.1)))
scale_y_continuous()
If you want to create more space at the top of the plot for the p-values, use scale_y_continuous(expand = expansion(mult = c(0, 0.1)))
. mult = c(0, 0.1)
adds 0% space at the bottom of the plot and 10% of space at the top of the plot. See here.
Wilcoxon rank-sum test and differential expression
Evidence suggests that the Wilcoxon rank-sum test can be used for differential expression testing in large sample size RNA-seq data sets. However, usually the data would be filtered and normalized prior to such testing.
If you are interested in using ggpubr to explore TCGA data, you may be interested in the following tutorials from Alboukadel Kassambara, the author of ggpubr:
- Facilitating Exploratory Data Visualization: Application to TCGA Genomic Data
- GGPUBR: How to Add Adjusted P-values to a Multi-Panel GGPlot
Using rstatix
What is rstatix?
rstatix
is a package created by Alboukadel Kassambara, the author of ggpubr
. The rstatix
package includes tidyverse friendly functions for performing various statistical tests. Because functions from rstatix
are tidyverse compatible, they play well with ggpubr
and ggplot2
. The number of stats functions in rstatix
far exceeds those available in ggpubr
, so this is a great package to look for additional or alternative statistical tests.
For example, if we want to perform Dunn's test for pairwise multiple comparisons as a posthoc to Kruskal-Wallis, we can use rstatix::dunn_test()
.
dunn<-bdata %>%
dunn_test(Units ~ RBC.Age.Group, p.adjust.method = "bonferroni") %>%
add_xy_position(x="RBC.Age.Group")
ggboxplot(bdata, x = "RBC.Age.Group", y = "Units",
color = "RBC.Age.Group",
palette =c("#00AFBB", "#E7B800", "#FC4E07")) +
stat_pvalue_manual(dunn, label = "p = {scales::pvalue(p.adj)}")
label = "p = {scales::pvalue(p.adj)}"
This line of code allows us to round our adjusted p-values; you could also use round
.
Additionally, using rstatix
can (sort of) simplify some of the code that we used previously. For example, let's return to the TCGA output above. Instead of using compare_means
, we can use the function wilcox_test()
directly. With the help of rstatix::add_xy_position()
, the results are ready to add to our box plot, negating the use of geom_pwc()
.
rexp<- exp %>% tidyr::pivot_longer(cols=3:5, names_to="Gene",
values_to="Expression")
w<- rexp %>% group_by(Gene) %>%
wilcox_test(data=.,Expression~dataset,
p.adjust.method="bonferroni") %>%
add_xy_position(x = "dataset")
ggboxplot(data=rexp, x = "dataset",
y = "Expression",
facet.by = "Gene",
ylab = "Expression",
color = "dataset",
palette = "jco") +
stat_pvalue_manual(w, label = "p.adj.signif", tip.length=0.01,
label.size=3,vjust=0.2) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1)))
Methods for Correlation
Correlation analysis refers to a group of statistical methods used to discover the relationship between two or more variables of interest. Common methods of correlation are included in the table below.
Methods | R function | Description |
---|---|---|
Pearson correlation | cor(); cor.test() | measures a linear dependence between two variables (parametric) |
Kendall tau | cor(); cor.test() | rank-based correlation coefficient (non-parametric) |
Spearman rho | cor(); cor.test() | rank-based correlation coefficient (non-parametric) |
Here is a nice post on Stack Exchange outlining some of the differences between these methods.
It can be fairly cumbersome to add correlation results and regression model output to scatter plots generated with ggplot2. ggpubr
simplifies this process with the following functions:
Let's see an example using the dataset TCGA-LUAD.
TCGA correlation example
Scatter plots are useful to visually inspect similarity (correlation) between variables. For example, we can see how closely two replicates from the TCGA lung adenocarcinoma data set correlate.
luad<-readr::read_csv("luad_example.csv")
#just correlation coeeficient
a<-ggscatter(luad, x = "Sample1A", y = "Sample1A2",color = "black",
shape = 21, size = 3, add = "reg.line",
add.params = list(color = "blue", fill = "lightgray"),
cor.coef = TRUE,
fullrange= TRUE,
cor.coeff.args = list(method = "pearson", label.x = 3)
) +
coord_fixed(ratio=1,xlim=c(3,20),ylim=c(3,20)) # add fixed coord
# to add r-square
b<-ggscatter(luad, x = "Sample1A", y = "Sample1A2",color = "black",
shape = 21, size = 3, add = "reg.line",
add.params = list(color = "blue", fill = "lightgray")) +
stat_cor(aes(label =
paste(..rr.label.., ..p.label.., sep = '~","~')),
label.x = 3) +
coord_fixed(ratio=1,xlim=c(3,20),ylim=c(3,20))
a+b
Warning: The dot-dot notation (`..rr.label..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(rr.label)` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
With ggpubr::ggscatter()
we can plot our data and add correlation coefficients and regression lines. To add the r-square value, you can use stat_corr()
directly. Since the axes of these plots are on the same scale, we can make sure the physical space of the plot is represented the same across axes using coord_fixed()
from ggplot2
.
Additionally, if you want the regression line equation, you can use stat_regline_equation()
.
ggpubr functions for genomics
Lastly, because this is a bioinformatics coding club, ggpubr
does include a function to create an MA-plot (ggmaplot()
). An MA plot is a common output from differential expression results. These plots show the log fold-change (base-2) on the y-axis versus normalized mean expression between two conditions.
Let's exemplify this using differential expression data included in the ggpubr
package, which were obtained from comparing the RNAseq data of two different cell populations using DESeq2.
data("diff_express")
dexp<-diff_express
head(dexp)
name baseMean log2FoldChange padj
ENSG00000000003 TSPAN6 0.1184475 0.0000000 NA
ENSG00000000419 DPM1 1654.4618144 0.6789538 5.280802e-02
ENSG00000000457 SCYL3 681.0463277 1.5263838 3.915112e-07
ENSG00000000460 C1orf112 389.7226640 3.8933573 1.180333e-14
ENSG00000000938 FGR 364.7810090 -2.3554014 1.559228e-04
ENSG00000000971 CFH 1.1346239 1.2932740 4.491812e-01
detection_call
ENSG00000000003 0
ENSG00000000419 1
ENSG00000000457 1
ENSG00000000460 1
ENSG00000000938 1
ENSG00000000971 0
ggmaplot(diff_express,
fdr = 0.01, fc = 2, size = 0.4,
genenames = as.vector(diff_express$name),
legend = "top", top = 20,
font.label = c("bold", 11),
font.legend = "bold",
font.main = "bold",
ggtheme = ggplot2::theme_minimal())
Arguments:
data
can be an object of class DESeqResults, get_diff, DE_Results, matrix or data frame containing the columns baseMean (or baseMeanLog2), log2FoldChange, and padj. Rows are genes.fdr
is the selected false discovery rate at which we will consider genes differentially expressed.fc
is the fold change threshold; only genes greater than or equal to this threshold with a significance less than or equal tofdr
will be considered significantly differentially expressed.top
is the number of genes to label on the plot. Top genes is decided by sorted adjusted p-values and fold change.
label.select
You can control which genes are labeled using the argument (label.select
). See the example from the help documentation for the function (?ggmaplot
).
Additional tutorials GGpubr
If you are interested in ggpubr
, you may find the following tutorials useful:
- http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/77-facilitating-exploratory-data-visualization-application-to-tcga-genomic-data/.
- https://malucalle.github.io/statistical-pills/nice-plots-with-r.html.
- https://documentation1458.rssing.com/chan-49683810/all_p8.html.
- https://biodash.github.io/codeclub/s02e10_ggpubr/
- https://rpubs.com/jhossain/901116
- http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/78-perfect-scatter-plots-with-correlation-and-marginal-histograms/