Lesson 8: Data Wrangling Review and Practice
Objectives
- Review important data wrangling functions
- Put our wrangling skills to use on a realistic RNA-Seq data set
Data Wrangling Review
Important functions by topic
Importing / Exporting Data
Importing and exporting data into the R environment is done using base R and readR
(readxl
in the case of excel files) functions. Most of these functions begin with read.
and read_
for importing, or write.
and write_
for exporting. You can use tab complete to find the appropriate function.
Data reshape (library(tidyr)
)
pivot_wider()
- Makes a long data set wide.
pivot_longer()
- Makes a wide data set long.
separate()
- Splits column content across multiple columns.
unite()
- Condenses content across multiple columns into a single column.
Dealing with row names (library(tibble)
)
rownames_to_column()
- Creates a column in your data frame from existing row names.
column_to_rownames()
- Creates row names from a column in your data frame.
Data Visualization (library(ggplot2)
)
There are 3 components required for all ggplot2 plots: DATA, GEOM_FUNCTION, and aes
MAPPINGS.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>))
Subsetting Data (library(dplyr)
)
select()
- Filters data by column.
-
Check out associated helper functions:
-
Select specific columns:
everything()
last_col()
-
Select columns by matching some aspect of the column name:
starts_with()
ends_with()
contains()
matches()
num_range()
-
Select columns named in a character vector:
all_of()
any_of()
-
Select a column using a function
where()
rename()
- rename a column without selecting
filter()
- Filters data by row.
filter()
often uses relational operators. Type?Comparison
and?match
in the console for more information.- For filtering across multiple rows, check out
if_all()
andif_any()
.
slice()
- positional subsetting
Reordering rows (library(dplyr)
)
arrange()
- orders rows by the values in specific columns
- Commonly used with
desc()
, which sorts values in descending order.
Joining Data frames (library(dplyr)
)
Mutating Joins
left_join()
- Output contains all rows from x
right_join()
- Output contains all rows from y
inner_join()
- Output contains matched rows from x
full_join()
- Output contains all rows from x and y
Filtering joins
semi_join()
- Outputs columns from x and rows from x that match y.
anti_join()
- Outputs columns from x and rows from x that DO NOT match y.
Transforming Variables (library(dplyr)
)
mutate()
- Create new columns while keeping all existing columns
transmute()
- Create new columns while dropping all existing columns.
To mutate across multiple columns using a single phrase use across()
. Because across()
requires you to select which columns you want to mutate, it can be used with the select()
helper functions.
Split, apply, combine (library(dplyr)
)
To apply some function to smaller subsets (groups) within your data, and return a summarized data frame, use group_by()
with summarize()
.
group_by()
- group a data frame by a categorical variable so that a given operation can be performed per group / category.
summarize()
- reduces multiple values down to a single summary using specified functions (e.g., mean, median, standard deviation, etc.).
Other useful tidyverse
packages
Working with dates
Check out library(lubridate)
.
Working with factors
Check out library(forcats)
.
Looking for a for loop
alternative
Check out library(purr)
.
Using R on Biowulf
If you wish to use R on Biowulf, you can view modules available using module -r avail '^R$'
. Loading the module requires an interactive session, sinteractive
(unless submitting a job).
-
Example of module load
module load R/3.5
-
After loading the module, to begin using R, type
R
.
For a list of currently installed packages in the default R environment, see https://hpc.nih.gov/apps/R.html#packages.
For instructions on using R interactively or via a batch job, see the R Biowulf help page. You may also find our course documentation for Toward Reproducibility with R on Biowulf helpful.
For information on obtaining a Biowulf account, click here.
Wrangling a realistic dataset
Provided Data
The provided data set is an example of a real count matrix returned from the NCI CCR Sequencing Facility (CCR-SF). The provided file (./data/SF_example_RNASeq_1.txt
) contains RNAseq data for two sets of samples (Cell1 and Cell2) run in triplicate.
Data Challenges
There are some immediate problems with the data.
- The column names begin with numbers, which are not syntactic with R.
- The gene names are hybrids of Ensembl ID and gene symbols and will match neither if needed in the future. Therefore, these should be split into their own columns.
- The values of the gene counts are in decimals. Most differential expression packages will expect integers.
Our challenge
First, we will want to fix the three immediate problems discussed above. Once those have been fixed, we are going to create a bar plot, using ggplot2
to plot the total gene counts per sample. With that accomplished we are then going to format the data so that it can be used to construct a DESeqDataSet
object. This object is necessary to apply different functions from DESeq2
including differential expression analysis.
Note: While the instructions below include some guidance on the different functions required, you may need to include helper functions or modify function arguments to complete each task. Expected output will be shown below each prompt.
Step 1: Load the data.
Begin by loading the data and saving to an object named dmat
.
Solution}
library(tidyverse)
dmat<-read_delim("./data/SF_example_RNASeq_1.txt",col_names =TRUE)
Great! You should have a 10,000 x 7 tibble if you used a readr
function for data loading.
dmat
## # A tibble: 10,000 × 7
## gene_id `1_Cell1_Rep1` `2_Cell1_Rep2` `3_Cell1_Rep3` `4_Cell2_Rep1`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG00000001630.… 6877. 6614 7058. 11305.
## 2 ENSG00000002016.… 283. 287. 287. 265.
## 3 ENSG00000002330.… 1946 1662 2121 608
## 4 ENSG00000002834.… 17636 19333 18917 4583
## 5 ENSG00000003056.… 3874 4107 4005 5741
## 6 ENSG00000003393.… 2041 2150 2141 1687
## 7 ENSG00000003989.… 279 345 305 18586
## 8 ENSG00000004534.… 2695. 3031 2871 1948
## 9 ENSG00000004838.… 42 52 39 61
## 10 ENSG00000004848.… 1 0 0 18
## # ℹ 9,990 more rows
## # ℹ 2 more variables: `5_Cell2_Rep2` <dbl>, `6_Cell2_Rep3` <dbl>
Step 2: Rename the Samples (column names)
Notice that the column names each begin with a number (e.g., 1_MB231_RNA1). Place an "S" at the beginning of each sample name using rename_with()
, which is similar to rename()
, but it uses a function to rename the columns. Look up the function ?paste()
to use with rename()
. Overwrite dmat
with dmat
.
Solution}
#The easiest way to do this is with the function paste
dmat<-rename_with(dmat,~ paste0("S",.x),contains("Cell"))
#for a gsub solution
#rename_with(dmat,~ gsub("^", "S", .x), contains("Cell")) #^ is a regex (at the beginning)
colnames(dmat)
## [1] "gene_id" "S1_Cell1_Rep1" "S2_Cell1_Rep2" "S3_Cell1_Rep3"
## [5] "S4_Cell2_Rep1" "S5_Cell2_Rep2" "S6_Cell2_Rep3"
Step 3: Separate the gene abbreviation from the Ensembl ID
Separate the gene abbreviation from the Ensembl ID in column 1 using separate()
. Save the output to a new object named dmat2
.
Solution}.
#separate gene abbreviation from ensembl id
dmat2<-separate(dmat, gene_id, into=c("Ensembl_ID","gene_abb"),sep="_",remove=TRUE,extra="merge")
Store the newly separated columns (Ensembl_ID
and gene_abb
) in a new data frame named gene_names
and drop the gene abbreviations from our working data frame (dmat2
) because we ultimately want this to be a data matrix (overwrite dmat2
with dmat2
).
Solution}
#separate gene abbreviation from ensembl id
gene_names<- dmat2 %>% select(1:2)
dmat2<-dmat2 %>% select(!gene_abb)
gene_names
## # A tibble: 10,000 × 2
## Ensembl_ID gene_abb
## <chr> <chr>
## 1 ENSG00000001630.11 CYP51A1
## 2 ENSG00000002016.12 RAD52
## 3 ENSG00000002330.9 BAD
## 4 ENSG00000002834.13 LASP1
## 5 ENSG00000003056.3 M6PR
## 6 ENSG00000003393.10 ALS2
## 7 ENSG00000003989.12 SLC7A2
## 8 ENSG00000004534.10 RBM6
## 9 ENSG00000004838.9 ZMYND10
## 10 ENSG00000004848.6 ARX
## # ℹ 9,990 more rows
dmat2
## # A tibble: 10,000 × 7
## Ensembl_ID S1_Cell1_Rep1 S2_Cell1_Rep2 S3_Cell1_Rep3 S4_Cell2_Rep1
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG00000001630.11 6877. 6614 7058. 11305.
## 2 ENSG00000002016.12 283. 287. 287. 265.
## 3 ENSG00000002330.9 1946 1662 2121 608
## 4 ENSG00000002834.13 17636 19333 18917 4583
## 5 ENSG00000003056.3 3874 4107 4005 5741
## 6 ENSG00000003393.10 2041 2150 2141 1687
## 7 ENSG00000003989.12 279 345 305 18586
## 8 ENSG00000004534.10 2695. 3031 2871 1948
## 9 ENSG00000004838.9 42 52 39 61
## 10 ENSG00000004848.6 1 0 0 18
## # ℹ 9,990 more rows
## # ℹ 2 more variables: S5_Cell2_Rep2 <dbl>, S6_Cell2_Rep3 <dbl>
Step 4: Convert the gene counts to integers
Many of the packages that handle RNASeq count data do not work correctly with decimal numbers. We need to convert these numbers to integers using mutate()
. Save your transformed data frame to an object named dmat3
.
Solution}
dmat3<-dmat2 %>% mutate(across(where(is.numeric),~as.integer(round(.))))
dmat3
## # A tibble: 10,000 × 7
## Ensembl_ID S1_Cell1_Rep1 S2_Cell1_Rep2 S3_Cell1_Rep3 S4_Cell2_Rep1
## <chr> <int> <int> <int> <int>
## 1 ENSG00000001630.11 6877 6614 7058 11305
## 2 ENSG00000002016.12 283 287 287 265
## 3 ENSG00000002330.9 1946 1662 2121 608
## 4 ENSG00000002834.13 17636 19333 18917 4583
## 5 ENSG00000003056.3 3874 4107 4005 5741
## 6 ENSG00000003393.10 2041 2150 2141 1687
## 7 ENSG00000003989.12 279 345 305 18586
## 8 ENSG00000004534.10 2695 3031 2871 1948
## 9 ENSG00000004838.9 42 52 39 61
## 10 ENSG00000004848.6 1 0 0 18
## # ℹ 9,990 more rows
## # ℹ 2 more variables: S5_Cell2_Rep2 <int>, S6_Cell2_Rep3 <int>
Step 5: Create a bar plot, using ggplot2
to plot the total gene counts per sample.
In order to create a ggplot2 bar plot we will need to reshape the data. The sample names should be in a single column named Sample
and the gene counts in a single column named Count
. Use pivot_longer()
and save the reshaped data to an object named ldmat
.
Solution}
ldmat<-pivot_longer(dmat3,starts_with("S"),names_to =c("Sample"),values_to= "Count")
ldmat
## # A tibble: 60,000 × 3
## Ensembl_ID Sample Count
## <chr> <chr> <int>
## 1 ENSG00000001630.11 S1_Cell1_Rep1 6877
## 2 ENSG00000001630.11 S2_Cell1_Rep2 6614
## 3 ENSG00000001630.11 S3_Cell1_Rep3 7058
## 4 ENSG00000001630.11 S4_Cell2_Rep1 11305
## 5 ENSG00000001630.11 S5_Cell2_Rep2 10761
## 6 ENSG00000001630.11 S6_Cell2_Rep3 10047
## 7 ENSG00000002016.12 S1_Cell1_Rep1 283
## 8 ENSG00000002016.12 S2_Cell1_Rep2 287
## 9 ENSG00000002016.12 S3_Cell1_Rep3 287
## 10 ENSG00000002016.12 S4_Cell2_Rep1 265
## # ℹ 59,990 more rows
Now create the plot.
Solution}
ldmat %>%
group_by(Sample) %>%
summarize(Tcounts= sum(Count)) %>%
ggplot() +
geom_bar(aes(x=Sample,y=Tcounts),stat="identity") +
geom_text(aes(x=Sample,y=Tcounts,label = Tcounts), vjust = 0)+
theme_classic()+
theme(axis.text.x = element_text(angle=90,vjust=0.5)) +
labs(x=NULL,y="Total Counts")
#ggplot(ldmat) + #Shows the same but is less clean
# geom_bar(aes(x=Sample,y=Count),stat="identity")
Step 6: Prepare objects needed to create a DESeqDataSet
object.
The object class used by the DESeq2 package to store the read counts and the intermediate estimated quantities during statistical analysis is the DESeqDataSet. --- Analyzing RNA-seq data with DESeq2
Constructing this object from a count matrix requires count data, sample information, and an experiment design. Our final objective is to prep a count matrix and sample information that can be used to create a DESeqDataSet
object.
The count data currently stored in a data frame will need to be a matrix. To do this, first convert the current column Ensembl_ID to row names. Next, before converting to a matrix, filter genes with low expression. Keep only genes with 10 or more reads across all samples (Hint: check out the function rowSums()
). Finally, convert the resulting data frame to a matrix (use as.matrix()
). Save this object back to dmat3
.
Solution}
dmat3<-dmat3 %>% column_to_rownames("Ensembl_ID") %>%
filter(rowSums(.) >= 10) %>%
as.matrix()
head(dmat3)
## S1_Cell1_Rep1 S2_Cell1_Rep2 S3_Cell1_Rep3 S4_Cell2_Rep1
## ENSG00000001630.11 6877 6614 7058 11305
## ENSG00000002016.12 283 287 287 265
## ENSG00000002330.9 1946 1662 2121 608
## ENSG00000002834.13 17636 19333 18917 4583
## ENSG00000003056.3 3874 4107 4005 5741
## ENSG00000003393.10 2041 2150 2141 1687
## S5_Cell2_Rep2 S6_Cell2_Rep3
## ENSG00000001630.11 10761 10047
## ENSG00000002016.12 235 254
## ENSG00000002330.9 711 576
## ENSG00000002834.13 4464 3892
## ENSG00000003056.3 5703 4978
## ENSG00000003393.10 1624 1426
Next, we need to create some sample information to include with our count matrix. Because our RNAseq data includes two sets of samples run in triplicate for two cell lines, Cell1 and Cell2, we can create sample information from the sample names using the function data.frame()
. The column names of our data matrix (dmat3
) and the row names of our sample metadata (samp_df
) should be in the same order.
Solution}
samp_df<-data.frame(Sample=colnames(dmat3),Cell_line=rep(c("Cell1","Cell2"),each=3),Replicate=rep(1:3)) %>% column_to_rownames("Sample")
samp_df
## Cell_line Replicate
## S1_Cell1_Rep1 Cell1 1
## S2_Cell1_Rep2 Cell1 2
## S3_Cell1_Rep3 Cell1 3
## S4_Cell2_Rep1 Cell2 1
## S5_Cell2_Rep2 Cell2 2
## S6_Cell2_Rep3 Cell2 3
Now we have the files needed to create a DESeqDataSet
object. Let's use the function DESeqDataSetFromMatrix()
. The general template is as follows:
dds <- DESeqDataSetFromMatrix(countData = data_matrix,
colData = sample_info,
design = ~ condition)
DESeqDataSet
object and using the package DESeq2
, check out this vignette.
To install DESeq2
use:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
And create the object
Solution}
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = dmat3,
colData = samp_df,
design = ~ Cell_line)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
dds
## class: DESeqDataSet
## dim: 3622 6
## metadata(1): version
## assays(1): counts
## rownames(3622): ENSG00000001630.11 ENSG00000002016.12 ...
## ENSGR0000169100.8 ENSGR0000223773.2
## rowData names(0):
## colnames(6): S1_Cell1_Rep1 S2_Cell1_Rep2 ... S5_Cell2_Rep2
## S6_Cell2_Rep3
## colData names(2): Cell_line Replicate