Skip to content

Lesson 8: Data Wrangling Review and Practice

Objectives

  1. Review important data wrangling functions
  2. Put our wrangling skills to use on a realistic RNA-Seq data set

Data Wrangling Review

Important functions by topic

Importing / Exporting Data

Importing and exporting data into the R environment is done using base R and readR (readxl in the case of excel files) functions. Most of these functions begin with read. and read_ for importing, or write. and write_ for exporting. You can use tab complete to find the appropriate function.

Data reshape (library(tidyr))

pivot_wider() - Makes a long data set wide.

pivot_longer() - Makes a wide data set long.

separate() - Splits column content across multiple columns.

unite() - Condenses content across multiple columns into a single column.

Dealing with row names (library(tibble))

rownames_to_column() - Creates a column in your data frame from existing row names.

column_to_rownames() - Creates row names from a column in your data frame.

Data Visualization (library(ggplot2))

There are 3 components required for all ggplot2 plots: DATA, GEOM_FUNCTION, and aes MAPPINGS.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>))  

Subsetting Data (library(dplyr))

select() - Filters data by column.

  • Check out associated helper functions:

  • Select specific columns:

    • everything()
    • last_col()
  • Select columns by matching some aspect of the column name:

    • starts_with()
    • ends_with()
    • contains()
    • matches()
    • num_range()
  • Select columns named in a character vector:

    • all_of()
    • any_of()
  • Select a column using a function

    • where()

rename() - rename a column without selecting

filter() - Filters data by row.

  • filter() often uses relational operators. Type ?Comparison and ?match in the console for more information.
  • For filtering across multiple rows, check out if_all() and if_any().

slice() - positional subsetting

Reordering rows (library(dplyr))

arrange() - orders rows by the values in specific columns

  • Commonly used with desc(), which sorts values in descending order.

Joining Data frames (library(dplyr))

Mutating Joins

left_join() - Output contains all rows from x
right_join() - Output contains all rows from y
inner_join() - Output contains matched rows from x
full_join() - Output contains all rows from x and y

Image from R4DS

Filtering joins

semi_join() - Outputs columns from x and rows from x that match y.
anti_join() - Outputs columns from x and rows from x that DO NOT match y.

Transforming Variables (library(dplyr))

mutate() - Create new columns while keeping all existing columns

transmute() - Create new columns while dropping all existing columns.

To mutate across multiple columns using a single phrase use across(). Because across() requires you to select which columns you want to mutate, it can be used with the select() helper functions.

Split, apply, combine (library(dplyr))

To apply some function to smaller subsets (groups) within your data, and return a summarized data frame, use group_by() with summarize().

group_by() - group a data frame by a categorical variable so that a given operation can be performed per group / category.

summarize() - reduces multiple values down to a single summary using specified functions (e.g., mean, median, standard deviation, etc.).

Other useful tidyverse packages

Working with dates

Check out library(lubridate).

Working with factors

Check out library(forcats).

Looking for a for loop alternative

Check out library(purr).

Using R on Biowulf

If you wish to use R on Biowulf, you can view modules available using module -r avail '^R$'. Loading the module requires an interactive session, sinteractive (unless submitting a job).

  • Example of module load module load R/3.5

  • After loading the module, to begin using R, type R.

For a list of currently installed packages in the default R environment, see https://hpc.nih.gov/apps/R.html#packages.

For instructions on using R interactively or via a batch job, see the R Biowulf help page. You may also find our course documentation for Toward Reproducibility with R on Biowulf helpful.

For information on obtaining a Biowulf account, click here.

Wrangling a realistic dataset

Provided Data

The provided data set is an example of a real count matrix returned from the NCI CCR Sequencing Facility (CCR-SF). The provided file (./data/SF_example_RNASeq_1.txt) contains RNAseq data for two sets of samples (Cell1 and Cell2) run in triplicate.

Data Challenges

There are some immediate problems with the data.

  1. The column names begin with numbers, which are not syntactic with R.
  2. The gene names are hybrids of Ensembl ID and gene symbols and will match neither if needed in the future. Therefore, these should be split into their own columns.
  3. The values of the gene counts are in decimals. Most differential expression packages will expect integers.

Our challenge

First, we will want to fix the three immediate problems discussed above. Once those have been fixed, we are going to create a bar plot, using ggplot2 to plot the total gene counts per sample. With that accomplished we are then going to format the data so that it can be used to construct a DESeqDataSet object. This object is necessary to apply different functions from DESeq2 including differential expression analysis.

Note: While the instructions below include some guidance on the different functions required, you may need to include helper functions or modify function arguments to complete each task. Expected output will be shown below each prompt.

Step 1: Load the data.

Begin by loading the data and saving to an object named dmat.

Solution}

library(tidyverse)
dmat<-read_delim("./data/SF_example_RNASeq_1.txt",col_names =TRUE)

Great! You should have a 10,000 x 7 tibble if you used a readr function for data loading.

dmat
## # A tibble: 10,000 × 7
##    gene_id           `1_Cell1_Rep1` `2_Cell1_Rep2` `3_Cell1_Rep3` `4_Cell2_Rep1`
##    <chr>                      <dbl>          <dbl>          <dbl>          <dbl>
##  1 ENSG00000001630.…          6877.          6614           7058.         11305.
##  2 ENSG00000002016.…           283.           287.           287.           265.
##  3 ENSG00000002330.…          1946           1662           2121            608 
##  4 ENSG00000002834.…         17636          19333          18917           4583 
##  5 ENSG00000003056.…          3874           4107           4005           5741 
##  6 ENSG00000003393.…          2041           2150           2141           1687 
##  7 ENSG00000003989.…           279            345            305          18586 
##  8 ENSG00000004534.…          2695.          3031           2871           1948 
##  9 ENSG00000004838.…            42             52             39             61 
## 10 ENSG00000004848.…             1              0              0             18 
## # ℹ 9,990 more rows
## # ℹ 2 more variables: `5_Cell2_Rep2` <dbl>, `6_Cell2_Rep3` <dbl>

Step 2: Rename the Samples (column names)

Notice that the column names each begin with a number (e.g., 1_MB231_RNA1). Place an "S" at the beginning of each sample name using rename_with(), which is similar to rename(), but it uses a function to rename the columns. Look up the function ?paste() to use with rename(). Overwrite dmat with dmat.

Solution}

#The easiest way to do this is with the function paste
dmat<-rename_with(dmat,~ paste0("S",.x),contains("Cell"))

#for a gsub solution
#rename_with(dmat,~ gsub("^", "S", .x), contains("Cell")) #^ is a regex (at the beginning)

colnames(dmat)
## [1] "gene_id"       "S1_Cell1_Rep1" "S2_Cell1_Rep2" "S3_Cell1_Rep3"
## [5] "S4_Cell2_Rep1" "S5_Cell2_Rep2" "S6_Cell2_Rep3"

Step 3: Separate the gene abbreviation from the Ensembl ID

Separate the gene abbreviation from the Ensembl ID in column 1 using separate(). Save the output to a new object named dmat2.

Solution}.

#separate gene abbreviation from ensembl id 
dmat2<-separate(dmat, gene_id, into=c("Ensembl_ID","gene_abb"),sep="_",remove=TRUE,extra="merge")

Store the newly separated columns (Ensembl_ID and gene_abb) in a new data frame named gene_names and drop the gene abbreviations from our working data frame (dmat2) because we ultimately want this to be a data matrix (overwrite dmat2 with dmat2).

Solution}

#separate gene abbreviation from ensembl id 
gene_names<- dmat2 %>% select(1:2)

dmat2<-dmat2 %>% select(!gene_abb)

gene_names
## # A tibble: 10,000 × 2
##    Ensembl_ID         gene_abb
##    <chr>              <chr>   
##  1 ENSG00000001630.11 CYP51A1 
##  2 ENSG00000002016.12 RAD52   
##  3 ENSG00000002330.9  BAD     
##  4 ENSG00000002834.13 LASP1   
##  5 ENSG00000003056.3  M6PR    
##  6 ENSG00000003393.10 ALS2    
##  7 ENSG00000003989.12 SLC7A2  
##  8 ENSG00000004534.10 RBM6    
##  9 ENSG00000004838.9  ZMYND10 
## 10 ENSG00000004848.6  ARX     
## # ℹ 9,990 more rows
dmat2
## # A tibble: 10,000 × 7
##    Ensembl_ID         S1_Cell1_Rep1 S2_Cell1_Rep2 S3_Cell1_Rep3 S4_Cell2_Rep1
##    <chr>                      <dbl>         <dbl>         <dbl>         <dbl>
##  1 ENSG00000001630.11         6877.         6614          7058.        11305.
##  2 ENSG00000002016.12          283.          287.          287.          265.
##  3 ENSG00000002330.9          1946          1662          2121           608 
##  4 ENSG00000002834.13        17636         19333         18917          4583 
##  5 ENSG00000003056.3          3874          4107          4005          5741 
##  6 ENSG00000003393.10         2041          2150          2141          1687 
##  7 ENSG00000003989.12          279           345           305         18586 
##  8 ENSG00000004534.10         2695.         3031          2871          1948 
##  9 ENSG00000004838.9            42            52            39            61 
## 10 ENSG00000004848.6             1             0             0            18 
## # ℹ 9,990 more rows
## # ℹ 2 more variables: S5_Cell2_Rep2 <dbl>, S6_Cell2_Rep3 <dbl>

Step 4: Convert the gene counts to integers

Many of the packages that handle RNASeq count data do not work correctly with decimal numbers. We need to convert these numbers to integers using mutate(). Save your transformed data frame to an object named dmat3.

Solution}

dmat3<-dmat2 %>% mutate(across(where(is.numeric),~as.integer(round(.)))) 

dmat3
## # A tibble: 10,000 × 7
##    Ensembl_ID         S1_Cell1_Rep1 S2_Cell1_Rep2 S3_Cell1_Rep3 S4_Cell2_Rep1
##    <chr>                      <int>         <int>         <int>         <int>
##  1 ENSG00000001630.11          6877          6614          7058         11305
##  2 ENSG00000002016.12           283           287           287           265
##  3 ENSG00000002330.9           1946          1662          2121           608
##  4 ENSG00000002834.13         17636         19333         18917          4583
##  5 ENSG00000003056.3           3874          4107          4005          5741
##  6 ENSG00000003393.10          2041          2150          2141          1687
##  7 ENSG00000003989.12           279           345           305         18586
##  8 ENSG00000004534.10          2695          3031          2871          1948
##  9 ENSG00000004838.9             42            52            39            61
## 10 ENSG00000004848.6              1             0             0            18
## # ℹ 9,990 more rows
## # ℹ 2 more variables: S5_Cell2_Rep2 <int>, S6_Cell2_Rep3 <int>

Step 5: Create a bar plot, using ggplot2 to plot the total gene counts per sample.

In order to create a ggplot2 bar plot we will need to reshape the data. The sample names should be in a single column named Sample and the gene counts in a single column named Count. Use pivot_longer() and save the reshaped data to an object named ldmat.

Solution}

ldmat<-pivot_longer(dmat3,starts_with("S"),names_to =c("Sample"),values_to= "Count")

ldmat
## # A tibble: 60,000 × 3
##    Ensembl_ID         Sample        Count
##    <chr>              <chr>         <int>
##  1 ENSG00000001630.11 S1_Cell1_Rep1  6877
##  2 ENSG00000001630.11 S2_Cell1_Rep2  6614
##  3 ENSG00000001630.11 S3_Cell1_Rep3  7058
##  4 ENSG00000001630.11 S4_Cell2_Rep1 11305
##  5 ENSG00000001630.11 S5_Cell2_Rep2 10761
##  6 ENSG00000001630.11 S6_Cell2_Rep3 10047
##  7 ENSG00000002016.12 S1_Cell1_Rep1   283
##  8 ENSG00000002016.12 S2_Cell1_Rep2   287
##  9 ENSG00000002016.12 S3_Cell1_Rep3   287
## 10 ENSG00000002016.12 S4_Cell2_Rep1   265
## # ℹ 59,990 more rows

Now create the plot.

Solution}

ldmat %>% 
  group_by(Sample) %>% 
  summarize(Tcounts= sum(Count)) %>%
  ggplot() +
  geom_bar(aes(x=Sample,y=Tcounts),stat="identity") +
  geom_text(aes(x=Sample,y=Tcounts,label = Tcounts), vjust = 0)+
  theme_classic()+
  theme(axis.text.x = element_text(angle=90,vjust=0.5)) +
  labs(x=NULL,y="Total Counts")

#ggplot(ldmat) + #Shows the same but is less clean
#  geom_bar(aes(x=Sample,y=Count),stat="identity") 

Step 6: Prepare objects needed to create a DESeqDataSet object.

The object class used by the DESeq2 package to store the read counts and the intermediate estimated quantities during statistical analysis is the DESeqDataSet. --- Analyzing RNA-seq data with DESeq2

Constructing this object from a count matrix requires count data, sample information, and an experiment design. Our final objective is to prep a count matrix and sample information that can be used to create a DESeqDataSet object.

The count data currently stored in a data frame will need to be a matrix. To do this, first convert the current column Ensembl_ID to row names. Next, before converting to a matrix, filter genes with low expression. Keep only genes with 10 or more reads across all samples (Hint: check out the function rowSums()). Finally, convert the resulting data frame to a matrix (use as.matrix()). Save this object back to dmat3.

Solution}

dmat3<-dmat3 %>% column_to_rownames("Ensembl_ID") %>%
  filter(rowSums(.) >= 10) %>% 
  as.matrix()

head(dmat3)
##                    S1_Cell1_Rep1 S2_Cell1_Rep2 S3_Cell1_Rep3 S4_Cell2_Rep1
## ENSG00000001630.11          6877          6614          7058         11305
## ENSG00000002016.12           283           287           287           265
## ENSG00000002330.9           1946          1662          2121           608
## ENSG00000002834.13         17636         19333         18917          4583
## ENSG00000003056.3           3874          4107          4005          5741
## ENSG00000003393.10          2041          2150          2141          1687
##                    S5_Cell2_Rep2 S6_Cell2_Rep3
## ENSG00000001630.11         10761         10047
## ENSG00000002016.12           235           254
## ENSG00000002330.9            711           576
## ENSG00000002834.13          4464          3892
## ENSG00000003056.3           5703          4978
## ENSG00000003393.10          1624          1426

Next, we need to create some sample information to include with our count matrix. Because our RNAseq data includes two sets of samples run in triplicate for two cell lines, Cell1 and Cell2, we can create sample information from the sample names using the function data.frame(). The column names of our data matrix (dmat3) and the row names of our sample metadata (samp_df) should be in the same order.

Solution}

samp_df<-data.frame(Sample=colnames(dmat3),Cell_line=rep(c("Cell1","Cell2"),each=3),Replicate=rep(1:3)) %>% column_to_rownames("Sample")

samp_df
##               Cell_line Replicate
## S1_Cell1_Rep1     Cell1         1
## S2_Cell1_Rep2     Cell1         2
## S3_Cell1_Rep3     Cell1         3
## S4_Cell2_Rep1     Cell2         1
## S5_Cell2_Rep2     Cell2         2
## S6_Cell2_Rep3     Cell2         3

Now we have the files needed to create a DESeqDataSet object. Let's use the function DESeqDataSetFromMatrix(). The general template is as follows:

dds <- DESeqDataSetFromMatrix(countData = data_matrix,
                              colData = sample_info,
                              design = ~ condition)
For more information on constructing the DESeqDataSet object and using the package DESeq2, check out this vignette.

To install DESeq2 use:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")

And create the object

Solution}

library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = dmat3,
                              colData = samp_df,
                              design = ~ Cell_line)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
dds

## class: DESeqDataSet 
## dim: 3622 6 
## metadata(1): version
## assays(1): counts
## rownames(3622): ENSG00000001630.11 ENSG00000002016.12 ...
##   ENSGR0000169100.8 ENSGR0000223773.2
## rowData names(0):
## colnames(6): S1_Cell1_Rep1 S2_Cell1_Rep2 ... S5_Cell2_Rep2
##   S6_Cell2_Rep3
## colData names(2): Cell_line Replicate