Data Visualization and Statistical Integration with ggpubr

Learning objectives

Explore the package ggpubR
Learn about statistical functions accessible within the ggpubr package.
Learn about options for visualization including how to include comparison brackets, p-values, and other statistical results

Warning

This tutorial is NOT meant to teach you about statistics. If you are not familiar with the statistical tests included, please look for other content on statistical theory. We have linked some content for your convenience.

ggplot2 is a popular R package for data visualization that uses layers to build high quality plots. There are over 100 packages that extend the functionality of ggplot2. This session of the BTEP Coding Club will focus on the package ggpubr, which facilitates plot customization and statistical integration, making it much easier to create publication ready plots with ggplot2. Specifically, this lesson will demonstrate how to visualize the results of common statistical tests (e.g., t-tests, Wilcoxon rank sum, ANOVA, Pearson correlation).

Package installation and loading

Both ggplot2 and ggpubr can be installed using install.packages(c("ggplot2", "ggpubr")). We will also use some additional packages for this tutorial including rstatix, which I will describe in more detail later, patchwork, and dplyr. patchwork is a package for arranging multiple plots in a single figure, while dplyr is a tidyverse package for data wrangling. rstatix, patchwork, and dplyr can also be installed using install.packages().

library(ggplot2)
library(ggpubr)
library(rstatix)
library(patchwork)
library(dplyr)

ggplot2 review

If you are interested in ggpubr, you are likely familiar with ggplot2. As a reminder, ggplot2 is a popular R graphics package associated with a family of packages known as the tidyverse. Tidyverse packages work effectively on data stored in data frames (or tibbles), which store variables in columns and observations in rows. In ggplot2, plots are built in layers, allowing one to incorporate multiple data sets and advanced features, thus allowing the generation of fairly complex plots.

To build a plot, we need data to plot, one or more geoms to represent the data, and mapping aesthetics, which describe how the variables are mapped to the geoms.

For example, let's plot the length of odontoblasts (tooth growth cells) in response to dose levels of Vitamin C in mg/day from the built in data set ToothGrowth.

# Load data
data("ToothGrowth") #obtain the data
df <- ToothGrowth #save to object

a<-ggplot(data= df, aes(x=factor(dose),y=len)) + #call data
  geom_boxplot(aes(color=factor(dose))) + #add boxplot
  geom_jitter(aes(color=factor(dose),shape=factor(dose))) + #add points
  theme_classic()+ #use built-in theme
  ggtitle("Toothgrowth (ggplot2)") #add title

a

What is ggpubR?

The strength of ggplot2 is also its weakness. Using a layered approach means that it can often take quite a bit of code to reach a publishable state, even for routine plots. For new users, it is additionally difficult to remember what elements are needed to adjust various plot features, as there are hundreds of functions in the ggplot2 package. ggpubr is a package designed to address some of these problems. The ‘ggpubr’ package provides some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots.

For example, we can use ggpubr to make the ggplot2 plot above with a single function (?ggboxplot).

b<-ggboxplot(df, x = "dose", y = "len", #add data, x, and y arguments
             color = "dose", #set the color aesthetic to a variable
             add = "jitter", shape = "dose", #add jitter points
             title="Toothgrowth (ggpubr)")  #add a title
b

The data for ggboxplot should be organized in a data frame; however, we will see in a moment that the organization isn't as strict as ggplot2, which generally relies on data organized in long format.

What is a boxplot?

A boxplot summarizes the distribution of a numeric variable across one or more groups, making it a convenient tool for quickly grasping differences between those groups. - From Data to Viz.

From the boxplot we can see the first and third quartiles of data in the lower and upper hinges (25th and 75th percentiles). We can also see the median, range of data, and potential outliers.

Adding a jitter to the boxplot also allows us to visualize the distribution of data points, which would otherwise be hidden.

ggpubr includes a number of functions for different visualizations, not just a boxplot. There are functions for visualizing distributions such as density plots, histograms, empirical cumulative density plots, and QQ plots (e.g., ggdensity(), stat_overlay_normal_density(), gghistogram(), ggecdf(), ggqqplot()).

For example, let's see how easy it is to make a density plot.

ggdensity(df, x = "len", #adding the data and x variable
   add = "mean", rug = TRUE, #including a line for the mean and rug 
   fill = "supp",color="supp",facet.by="dose",palette = "npg")

What is a density plot?

A density plot is a representation of the distribution of a numeric variable. It uses a kernel density estimate to show the probability density function of the variable. - from Data to Viz

As you can see, using ggpubr, we were able to create subplots (facets), a rug plot, and add a summary statistic (the mean). This was done with a single function, ggdensity(). There are also many options for plot customization, which you can see in the associated help documentation, ?ggdensity().

What is a rug plot?

A rug plot displays individual cases along the margins of distributions.

In addition, ggpubr also includes singular functions for violin plots (ggviolin), dot plots (ggdotplot), strip charts (ggstripchart), bar plots (ggbarplot), line plots (ggline), error plots (ggerrorplot), pie charts (ggpie), donut charts (ggdonutchart), Cleveland's Dot Plots (ggdotchart), and scatter plots. See the ggpubr package reference for more information.

Statistical Integration

ggpubr allows us to easily incorporate additional information into our plots (e.g., mean, individual case information, etc.). Where this package really shines is its ability to easily include the results of common statistical tests. Here, I will focus on how this package is useful for displaying the results of common statistical tests such as those used to compare means between groups and those involved in deciphering relationships between variables (e.g., correlation).

Consult a statistician

Bioinformaticians have training in statistics. However, we are not statisticians. I recommend consulting with a statistician for statistics related questions. The Advanced Biomedical Computation Science (ABCS) Group offers no fee statistical support for NCI researchers. If you have a stats question about your specific project, fill out a project request.

This tutorial will use the following data to demo ggpubr statistical functions.

Blood storage data

This data set contains data on 316 men who had undergone radical prostatectomy and received transfusion during or within 30 days of the surgical procedure and had available prostate serum antigen (PSA) follow-up data. The main exposure of interest was RBC storage duration group. A number of demographic, baseline and prognostic factors were also collected. - R package documentation: medicaldata

Data citation:
Cata JP, Klein EA, Hoeltge GA, Dalton JE, Mascha E, O'Hara J, Russell A, Kurz A, Ben-Elihayhu S, Sessler DI. Blood storage duration and biochemical recurrence of cancer after radical prostatectomy. Mayo Clin Proc. 2011 Feb;86(2):120-7. doi: 10.4065/mcp.2010.0313. Erratum in: Mayo Clin Proc. 2011 Apr;86(4):364. PMID: 21282486; PMCID: PMC3031436.

These data (?blood_storage) can be installed with devtools::install_github("higgi13425/medicaldata").

library(medicaldata)
bdata<-blood_storage
glimpse(bdata)

Rows: 316
Columns: 20
$ RBC.Age.Group    <dbl> 3, 3, 3, 2, 2, 3, 3, 1, 1, 2, 2, 1, 1, 1, 3, 1, 1, 1,…
$ Median.RBC.Age   <dbl> 25, 25, 25, 15, 15, 25, 25, 10, 10, 15, 15, 10, 10, 1…
$ Age              <dbl> 72.1, 73.6, 67.5, 65.8, 63.2, 65.4, 65.5, 67.1, 63.9,…
$ AA               <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,…
$ FamHx            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ PVol             <dbl> 54.0, 43.2, 102.7, 46.0, 60.0, 45.9, 42.6, 40.7, 45.0…
$ TVol             <dbl> 3, 3, 1, 1, 2, 2, 2, 3, 2, 2, 1, 3, 2, 2, 2, 2, 1, 2,…
$ T.Stage          <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, NA, 1…
$ bGS              <dbl> 3, 2, 3, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1,…
$ `BN+`            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ OrganConfined    <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,…
$ PreopPSA         <dbl> 14.08, 10.50, 6.98, 4.40, 21.40, 5.10, 6.03, 8.70, 3.…
$ PreopTherapy     <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,…
$ Units            <dbl> 6, 2, 1, 2, 3, 1, 2, 4, 1, 2, 2, 2, 2, 4, 2, 4, 5, 4,…
$ sGS              <dbl> 1, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 1, 3, 1, 2, 2,…
$ AnyAdjTherapy    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ AdjRadTherapy    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Recurrence       <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Censor           <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ TimeToRecurrence <dbl> 2.67, 47.63, 14.10, 59.47, 1.23, 74.70, 13.87, 8.37, …

TCGA data (included within the ggpubr package)

I will also show how these methods, specifically the Wilcoxon Rank Sum and Pearson Correlation, can be applied to gene expression data from The Caner Genome Atlas (TCGA). TCGA is a massive (~2.5 petabytes) and largely open-source dataset comprised of genomic, epigenomic, transcriptomic, and proteomic data combined with rich clinical information and related metadata from over 11,000 patients representing 33 cancer types. Learn more about TCGA here.

For this, we will use ?gene_expression; a built in data set in the ggpubr package, which includes

gene expression data extracted from TCGA using the 'RTCGA' and 'RTCGA.mRNA' R packages. It contains the mRNA expression for 3 genes - GATA3, PTEN and XBP1- from 3 different datasets: Breast invasive carcinoma (BRCA), Ovarian serous cystadenocarcinoma (OV) and Lung squamous cell carcinoma (LUSC) - gene_expression {ggpubr}
```
exp<-gene_expression
glimpse(exp)
```
```
Rows: 1,305
Columns: 5
$ bcr_patient_barcode <chr> "BRCA1", "BRCA2", "BRCA3", "BRCA4", "BRCA5", "BRCA…
$ dataset             <chr> "BRCA", "BRCA", "BRCA", "BRCA", "BRCA", "BRCA", "B…
$ GATA3               <dbl> 2.870500, 2.166250, 1.323500, 1.841625, -6.025250,…
$ PTEN                <dbl> 1.361357143, 0.428357143, 1.305642857, 0.809642857…
$ XBP1                <dbl> 2.9833333, 2.5508333, 3.0204167, 3.1313333, -1.451…
```

I will also use a small subset of lung adenocarcinoma (TCGA-LUAD) data obtained using the package TCGAbiolinks to demonstrate features of ggpubr useful for correlation. These data were prepared outside of this tutorial. Feel free to download the prepared data here.

Methods for Comparing Means

The following include statistical tests commonly employed to compare the means of two groups. Importantly, both parametric and non-parametric tests are highlighted. The assumptions underlying parametric tests (e.g., normally distributed data, homogeneity of variance, independence, etc.) are not the same as non-parametric tests. Keep in mind that if the assumptions of an underlying statistical test are not met, the results are likely to be inaccurate. In general, non-parametric tests are more applicable to high throughput data, which rarely meets the assumptions of parametric tests.

Methods	R function	Description
T-test	t.test()	Compare two groups (parametric)
Wilcoxon test	wilcox.test()	Compare two groups (non-parametric)
ANOVA	aov() or anova()	Compare multiple groups (parametric)
Kruskal-Wallis	kruskal.test()	Compare multiple groups (non-parametric)

This table was taken from a tutorial entitled Add P-values and Significance Levels to ggplots.

Note

The T-test or Wilcoxon test can be paired or unpaired. Paired tests are for dependent data.

Functions available in `ggpubr` for comparing means

"Comparing Means and Adding p-values" — Image from the ggpubr function reference page.

Interested in more tutorials, check out this course on comparing means in R from DATANOVIA.

Using `compare_means()`

The main function in ggpubr to compare means between two or more groups is compare_means(). This function includes methods for each of the tests listed in the above table.

Let's see how this function works. Let's consider our blood storage data set. In order to determine whether blood storage time (young, middle, old) impacted the recurrence of prostate cancer following prostatectomy with perioperative transfusion, the authors first looked at variables that could be univariably significant with different exposure groups (blood storage groups). Variables that were significant could then be accounted for in survival models.

Let's see if patient age significantly differed between the three blood storage groups using an Analysis of Variance test (ANOVA). (Again, don't forget to check your assumptions).

Assumptions Testing

Here is a quick tutorial on checking the assumptions of parametric tests.

#Shapiro-Wilk test is one way to check for normality
shapiro.test(bdata$Age)

    Shapiro-Wilk normality test

data:  bdata$Age
W = 0.99183, p-value = 0.07886

#Levene's test can be used to check for homogeneity of variance.
car::leveneTest(Age~factor(RBC.Age.Group),center="mean",bdata)

Levene's Test for Homogeneity of Variance (center = "mean")
       Df F value Pr(>F)
group   2  0.4671 0.6272
      313

compare_means(Age~RBC.Age.Group,bdata,method="anova")

# A tibble: 1 × 6
  .y.       p p.adj p.format p.signif method
  <chr> <dbl> <dbl> <chr>    <chr>    <chr> 
1 Age   0.448  0.45 0.45     ns       Anova

compare_means() Usage:

compare_means(
formula, # This should be in the format numeric variable ~ factor (group)
data, # The data to be used for the test (a data frame)
method = "wilcox.test", # The method to be used (Wilcoxon rank sum, T-test, ANOVA, Kruskal-Wallis).
paired = FALSE, # Are your data paired?
group.by = NULL, # Include grouping variables
ref.group = NULL, # Can assign a reference group to compare all other groups to.
symnum.args = list(), # See package documentation
p.adjust.method = "holm", # Can specify the p-value correction method for multiple comparisons (e.g., Holm, Hochberg, Hommel, Bonferroni, FDR, etc.)
...
)

From our results, we can see that there is not a significant difference in patient ages by blood storage group. Notice, this is not nearly as descriptive as using aov(), and this can only perform a one-way ANOVA. However, it is a convenient function and plays nicely with ggpubr graphing options, which we will look at in more detail.

What about the number of allogeneic units? Did the number of units a patient received significantly differ between the exposure groups? Here we can use a Kruskal-Wallis by changing the method argument (method="kruskal.test").

compare_means(Units~RBC.Age.Group,bdata,method="kruskal.test")

# A tibble: 1 × 6
  .y.        p p.adj p.format p.signif method        
  <chr>  <dbl> <dbl> <chr>    <chr>    <chr>         
1 Units 0.0199  0.02 0.02     *        Kruskal-Wallis

While an ANOVA or Kruskal-Wallis can tell us there are significant differences between our groups, pairwise testing can then tell us which groups are different. With compare_means() we can use a T-test or Wilcoxon Rank Sum as a post-hoc for pairwise comparisons.

For example,

wtest<-compare_means(Units~RBC.Age.Group,bdata)
wtest

# A tibble: 3 × 8
  .y.   group1 group2       p p.adj p.format p.signif method  
  <chr> <chr>  <chr>    <dbl> <dbl> <chr>    <chr>    <chr>   
1 Units 3      2      0.0762  0.15  0.0762   ns       Wilcoxon
2 Units 3      1      0.00718 0.022 0.0072   **       Wilcoxon
3 Units 2      1      0.276   0.28  0.2755   ns       Wilcoxon

Here, we are making use of default arguments. See ?compare_means().

Integrate these results into our visualizations

If we are just interested in quickly adding the p-value of our Kruskal-Wallis, we can use stat_compare_means(). You do not need to run compare_means() prior to stat_compare_means().

ggboxplot(bdata, x = "RBC.Age.Group", y = "Units",
          color = "RBC.Age.Group",
          palette =c("#00AFBB", "#E7B800", "#FC4E07")) +
  stat_compare_means(label.y = 28)

However, if we want to include multiple comparisons and report p-values adjusted for multiple comparisons, we will need to use compare_means() or related function (See the section on rstatix).

ggboxplot(bdata, x = "RBC.Age.Group", y = "Units",
          color = "RBC.Age.Group", 
          palette =c("#00AFBB", "#E7B800", "#FC4E07")) +
  stat_compare_means(label.y = 28,) +
  stat_pvalue_manual(
  wtest, y.position=c(26,22,15),
  label = "p.adj"
)

hide.ns

If you have a bunch of non-significant comparisons, hide.ns will remove these from the results.

R redundancy

There are many ways to reach the same or similar outcome in R, even with ggpubr. For example, we could have used the following:

    ggboxplot(bdata, x = "RBC.Age.Group", y = "Units",
          color = "RBC.Age.Group", 
          palette =c("#00AFBB", "#E7B800", "#FC4E07")) +
  stat_compare_means(label.y = 28,) +
  stat_pwc( method = "wilcox_test", label = "p.adj.format")

If you want more information on your plot, try one of the functions specific to your statistical test (e.g., stat_kruskal_test()).

ggboxplot(bdata, x = "RBC.Age.Group", y = "Units",
          color = "RBC.Age.Group",
          palette =c("#00AFBB", "#E7B800", "#FC4E07")) +
  stat_kruskal_test(label = "as_detailed_italic", label.y=17)

TCGA data example

Let's look at a more complicated example using the gene_expression data. We can use ggpubr to explore genes of interest. However, in cases where comparisons are more complicated and faceting is used, this can get even more complicated.

Data Reshaping

In our exp data frame, the results of each gene are in their own column. With ggplot2, these data would need to be reshaped for effective plotting. However, with ggpubr, we can include multiple arguments in the y argument. For example, y = c("GATA3", "PTEN", "XBP1"). Here, we need to also specify how we would like to merge the results. If we want these variable in the same plotting area, we use merge=TRUE; alternatively, if we want subplots or facets, we use combine=FALSE.

Because we are working with multiple comparisons, facets, and adjusted p-values, we can use geom_pwc() or stat_pws(). These functions work with box plots, dot plots, and stripcharts.

compare_means(c(GATA3, PTEN, XBP1) ~ dataset, data = exp, 
              p.adjust.method="bonferroni")

# A tibble: 9 × 8
  .y.   group1 group2         p     p.adj p.format p.signif method  
  <fct> <chr>  <chr>      <dbl>     <dbl> <chr>    <chr>    <chr>   
1 GATA3 BRCA   OV     1.11e-177 3.30e-177 < 2e-16  ****     Wilcoxon
2 GATA3 BRCA   LUSC   6.68e- 73 2   e- 72 < 2e-16  ****     Wilcoxon
3 GATA3 OV     LUSC   2.97e-  8 8.90e-  8 3.0e-08  ****     Wilcoxon
4 PTEN  BRCA   OV     6.79e-  5 2   e-  4 6.8e-05  ****     Wilcoxon
5 PTEN  BRCA   LUSC   1.04e- 16 3.10e- 16 < 2e-16  ****     Wilcoxon
6 PTEN  OV     LUSC   1.28e-  7 3.8 e-  7 1.3e-07  ****     Wilcoxon
7 XBP1  BRCA   OV     2.55e-123 7.70e-123 < 2e-16  ****     Wilcoxon
8 XBP1  BRCA   LUSC   1.95e- 42 5.90e- 42 < 2e-16  ****     Wilcoxon
9 XBP1  OV     LUSC   4.24e- 11 1.3 e- 10 4.2e-11  ****     Wilcoxon

#This does not use output from line above
ggboxplot(exp, x = "dataset",
          y = c("GATA3", "PTEN", "XBP1"), # multiple y arguments
          combine = TRUE, # adding facets
          ylab = "Expression",
          color = "dataset",
          palette = "jco") +
  geom_pwc( #geom for pairwise comparisons 
    aes(group = dataset), 
    method = "wilcox_test", label = "p.adj.format",
    p.adjust.method = "bonferroni") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1)))

scale_y_continuous()

If you want to create more space at the top of the plot for the p-values, use scale_y_continuous(expand = expansion(mult = c(0, 0.1))). mult = c(0, 0.1) adds 0% space at the bottom of the plot and 10% of space at the top of the plot. See here.

Wilcoxon rank-sum test and differential expression

Evidence suggests that the Wilcoxon rank-sum test can be used for differential expression testing in large sample size RNA-seq data sets. However, usually the data would be filtered and normalized prior to such testing.

If you are interested in using ggpubr to explore TCGA data, you may be interested in the following tutorials from Alboukadel Kassambara, the author of ggpubr:

Using rstatix

What is rstatix?

rstatix is a package created by Alboukadel Kassambara, the author of ggpubr. The rstatix package includes tidyverse friendly functions for performing various statistical tests. Because functions from rstatix are tidyverse compatible, they play well with ggpubr and ggplot2. The number of stats functions in rstatix far exceeds those available in ggpubr, so this is a great package to look for additional or alternative statistical tests.

For example, if we want to perform Dunn's test for pairwise multiple comparisons as a posthoc to Kruskal-Wallis, we can use rstatix::dunn_test().

dunn<-bdata %>% 
  dunn_test(Units ~ RBC.Age.Group, p.adjust.method = "bonferroni") %>%
  add_xy_position(x="RBC.Age.Group") 

ggboxplot(bdata, x = "RBC.Age.Group", y = "Units",
          color = "RBC.Age.Group", 
          palette =c("#00AFBB", "#E7B800", "#FC4E07")) +
  stat_pvalue_manual(dunn, label = "p = {scales::pvalue(p.adj)}")

label = "p = {scales::pvalue(p.adj)}"

This line of code allows us to round our adjusted p-values; you could also use round.

Additionally, using rstatix can (sort of) simplify some of the code that we used previously. For example, let's return to the TCGA output above. Instead of using compare_means, we can use the function wilcox_test() directly. With the help of rstatix::add_xy_position(), the results are ready to add to our box plot, negating the use of geom_pwc().

rexp<- exp %>% tidyr::pivot_longer(cols=3:5, names_to="Gene",
                                   values_to="Expression")
w<- rexp %>% group_by(Gene) %>% 
  wilcox_test(data=.,Expression~dataset,
              p.adjust.method="bonferroni") %>% 
  add_xy_position(x = "dataset")


ggboxplot(data=rexp, x = "dataset",
          y = "Expression",
          facet.by = "Gene",
          ylab = "Expression",
          color = "dataset",
          palette = "jco") +
  stat_pvalue_manual(w, label = "p.adj.signif", tip.length=0.01,
                     label.size=3,vjust=0.2) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1)))

Methods for Correlation

Correlation analysis refers to a group of statistical methods used to discover the relationship between two or more variables of interest. Common methods of correlation are included in the table below.

Methods	R function	Description
Pearson correlation	cor(); cor.test()	measures a linear dependence between two variables (parametric)
Kendall tau	cor(); cor.test()	rank-based correlation coefficient (non-parametric)
Spearman rho	cor(); cor.test()	rank-based correlation coefficient (non-parametric)

Here is a nice post on Stack Exchange outlining some of the differences between these methods.

It can be fairly cumbersome to add correlation results and regression model output to scatter plots generated with ggplot2. ggpubr simplifies this process with the following functions:

Image from the ggpubr function reference page.

Let's see an example using the dataset TCGA-LUAD.

TCGA correlation example

Scatter plots are useful to visually inspect similarity (correlation) between variables. For example, we can see how closely two replicates from the TCGA lung adenocarcinoma data set correlate.

luad<-readr::read_csv("luad_example.csv")

#just correlation coeeficient
a<-ggscatter(luad, x = "Sample1A", y = "Sample1A2",color = "black", 
          shape = 21, size = 3, add = "reg.line",
          add.params = list(color = "blue", fill = "lightgray"),
          cor.coef = TRUE,
          fullrange= TRUE,
   cor.coeff.args = list(method = "pearson", label.x = 3)
   ) +
  coord_fixed(ratio=1,xlim=c(3,20),ylim=c(3,20)) # add fixed coord


# to add r-square
b<-ggscatter(luad, x = "Sample1A", y = "Sample1A2",color = "black", 
          shape = 21, size = 3, add = "reg.line",
          add.params = list(color = "blue", fill = "lightgray")) +
  stat_cor(aes(label = 
                 paste(..rr.label.., ..p.label.., sep = '~","~')),
           label.x = 3) +
  coord_fixed(ratio=1,xlim=c(3,20),ylim=c(3,20))

a+b

Warning: The dot-dot notation (`..rr.label..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(rr.label)` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
  Please report the issue at <https://github.com/kassambara/ggpubr/issues>.

With ggpubr::ggscatter() we can plot our data and add correlation coefficients and regression lines. To add the r-square value, you can use stat_corr() directly. Since the axes of these plots are on the same scale, we can make sure the physical space of the plot is represented the same across axes using coord_fixed() from ggplot2.

Additionally, if you want the regression line equation, you can use stat_regline_equation().

ggpubr functions for genomics

Lastly, because this is a bioinformatics coding club, ggpubr does include a function to create an MA-plot (ggmaplot()). An MA plot is a common output from differential expression results. These plots show the log fold-change (base-2) on the y-axis versus normalized mean expression between two conditions.

Let's exemplify this using differential expression data included in the ggpubr package, which were obtained from comparing the RNAseq data of two different cell populations using DESeq2.

data("diff_express")
dexp<-diff_express
head(dexp)

                    name     baseMean log2FoldChange         padj
ENSG00000000003   TSPAN6    0.1184475      0.0000000           NA
ENSG00000000419     DPM1 1654.4618144      0.6789538 5.280802e-02
ENSG00000000457    SCYL3  681.0463277      1.5263838 3.915112e-07
ENSG00000000460 C1orf112  389.7226640      3.8933573 1.180333e-14
ENSG00000000938      FGR  364.7810090     -2.3554014 1.559228e-04
ENSG00000000971      CFH    1.1346239      1.2932740 4.491812e-01
                detection_call
ENSG00000000003              0
ENSG00000000419              1
ENSG00000000457              1
ENSG00000000460              1
ENSG00000000938              1
ENSG00000000971              0

ggmaplot(diff_express,
   fdr = 0.01, fc = 2, size = 0.4,
   genenames = as.vector(diff_express$name),
   legend = "top", top = 20,
   font.label = c("bold", 11),
   font.legend = "bold",
   font.main = "bold",
   ggtheme = ggplot2::theme_minimal())

Arguments:

data can be an object of class DESeqResults, get_diff, DE_Results, matrix or data frame containing the columns baseMean (or baseMeanLog2), log2FoldChange, and padj. Rows are genes.
fdr is the selected false discovery rate at which we will consider genes differentially expressed.
fc is the fold change threshold; only genes greater than or equal to this threshold with a significance less than or equal to fdr will be considered significantly differentially expressed.
top is the number of genes to label on the plot. Top genes is decided by sorted adjusted p-values and fold change.

label.select

You can control which genes are labeled using the argument (label.select). See the example from the help documentation for the function (?ggmaplot).

Additional tutorials GGpubr

If you are interested in ggpubr, you may find the following tutorials useful: