Lesson 5: R Data Structures - Data Frames

Learning Objectives

This is the last lesson in Part 1 of Introductory R for Novices: Getting Started with R. This lesson will focus exclusively on working with data frames. Attendees will learn how to examine, summarize, and access data in data frames.

Specific learning objectives include:

Review data import.
Learn how to view and summarize data in a data frame.
Learn how to use data accessors.
Learn the syntax for sub-setting a data frame.

To get started with this lesson, you will first need to connect to RStudio on Biowulf. To connect to NIH HPC Open OnDemand, you must be on the NIH network. Use the following website to connect: https://hpcondemand.nih.gov/. Then follow the instructions outlined here.

Load the libraries

This lesson will use some functions from the tidyverse.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Examining and summarizing data frames

All of the objects we imported in the previous lesson, were data frames. In this lesson, we will learn how to view and find out more information regarding the data stored in a data frame. Let's use the R object, smeta as an example.

smeta<-read.delim("./data/airway_sampleinfo.txt")
head(smeta)

  SampleName    cell   dex albut        Run avgLength Experiment    Sample
1 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345 SRS508568
2 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346 SRS508567
3 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349 SRS508571
4 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350 SRS508572
5 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353 SRS508575
6 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354 SRS508576
     BioSample
1 SAMN02422669
2 SAMN02422675
3 SAMN02422678
4 SAMN02422670
5 SAMN02422682
6 SAMN02422673

We can view these data by clicking on the name of the object in the Environment pane or by using View().

To understand more about the underlying structure of our data, we can use str() or a similar function dplyr::glimpse.

str(smeta)

'data.frame':   8 obs. of  9 variables:
 $ SampleName: chr  "GSM1275862" "GSM1275863" "GSM1275866" "GSM1275867" ...
 $ cell      : chr  "N61311" "N61311" "N052611" "N052611" ...
 $ dex       : chr  "untrt" "trt" "untrt" "trt" ...
 $ albut     : chr  "untrt" "untrt" "untrt" "untrt" ...
 $ Run       : chr  "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" ...
 $ avgLength : int  126 126 126 87 120 126 101 98
 $ Experiment: chr  "SRX384345" "SRX384346" "SRX384349" "SRX384350" ...
 $ Sample    : chr  "SRS508568" "SRS508567" "SRS508571" "SRS508572" ...
 $ BioSample : chr  "SAMN02422669" "SAMN02422675" "SAMN02422678" "SAMN02422670" ...

str() shows us that we are looking at a data frame object with 8 rows by 9 columns. The column names are to the far left preceded by a $. This is a data frame accessor, and we will see how this works later. We can also see the data types (e.g., character, integer, logical, double) after the column name. This will help us understand how we can transform and visualize the data in these columns.

We can also get an overview of summary statistics of this data frame using summary().

summary(smeta)

  SampleName            cell               dex               albut          
 Length:8           Length:8           Length:8           Length:8          
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  



     Run              avgLength      Experiment           Sample         
 Length:8           Min.   : 87.0   Length:8           Length:8          
 Class :character   1st Qu.:100.2   Class :character   Class :character  
 Mode  :character   Median :123.0   Mode  :character   Mode  :character  
                    Mean   :113.8                                        
                    3rd Qu.:126.0                                        
                    Max.   :126.0                                        
  BioSample        
 Length:8          
 Class :character  
 Mode  :character

Our data frame has 9 variables, so we get 9 fields that summarize the data. The only column with numerical data is avgLength, for which we can see summary statistics on the min and max values as well as mean, median, and interquartile ranges.

Using summary() with factors

summary() is also useful for obtaining quick information about a categorial (factor) variable, answering how many groups and the sample size of each group.

smeta2 <- smeta %>% mutate(dex = as.factor(dex))
summary(smeta2)

  SampleName            cell              dex       albut          
  Length:8           Length:8           trt  :4   Length:8          
  Class :character   Class :character   untrt:4   Class :character  
  Mode  :character   Mode  :character             Mode  :character  



      Run              avgLength      Experiment           Sample         
  Length:8           Min.   : 87.0   Length:8           Length:8          
  Class :character   1st Qu.:100.2   Class :character   Class :character  
  Mode  :character   Median :123.0   Mode  :character   Mode  :character  
                    Mean   :113.8                                        
                    3rd Qu.:126.0                                        
                    Max.   :126.0                                        
  BioSample        
  Length:8          
  Class :character  
  Mode  :character

What is the length of our data.frame? What are the dimensions?

Other attributes we may want to know regarding our data frame include the number of columns (ncol(), length()) and the dimensions (dim()).

#length returns the number of columns
length(smeta)

[1] 9

#dimensions, returns the row and column numbers
dim(smeta)

[1] 8 9

Other useful functions for inspecting data frames

Size:
nrow() - number of rows
ncol() - number of columns

Content:
head() - returns first 6 rows by default
tail() - returns last 6 rows by default

Names:
colnames() - returns column names
rownames() - returns row names

Section content from "Starting with Data", Introduction to data analysis with R and Bioconductor.

Data frame coercion and accessors

Let's pretend that the sample IDs were numeric rather than of type character.

smeta$SampleID <- c(1:nrow(smeta))
smeta

  SampleName    cell   dex albut        Run avgLength Experiment    Sample
1 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345 SRS508568
2 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346 SRS508567
3 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349 SRS508571
4 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350 SRS508572
5 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353 SRS508575
6 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354 SRS508576
7 GSM1275874 N061011 untrt untrt SRR1039520       101  SRX384357 SRS508579
8 GSM1275875 N061011   trt untrt SRR1039521        98  SRX384358 SRS508580
     BioSample SampleID
1 SAMN02422669        1
2 SAMN02422675        2
3 SAMN02422678        3
4 SAMN02422670        4
5 SAMN02422682        5
6 SAMN02422673        6
7 SAMN02422683        7
8 SAMN02422677        8

Unless stated otherwise, "SampleID" will be treated as numeric rather than as a character vector. If we intend to work with this column and treat it as an ID, we will need to convert it or coerce it to a character or factor vector.

We can access a column of our data frame using [], [[]], or using the $. These behave slightly differently, as we will see.

Let's access "SampleID" from smeta.

#Using $
smeta$SampleID

[1] 1 2 3 4 5 6 7 8

#Using []  
smeta["SampleID"]

#Using [[]]  
smeta[["SampleID"]]

[1] 1 2 3 4 5 6 7 8

Notice that $ and [[]] behave similarly. These return a vector, while [] maintains the original structure, in this case a data frame.

Let's convert the "SampleID" column from an integer to a character vector. This is known as coercion.

#We can see that sample is being treated as numeric
is.numeric(smeta$SampleID)

[1] TRUE

#let's convert it to a character vector
smeta$SampleID<-as.character(smeta$SampleID)
#check this
is.character(smeta$SampleID)

[1] TRUE

#check this
is.numeric(smeta$SampleID)

[1] FALSE

See other related functions (e.g., as.factor(),as.numeric()).

Be careful with data coercion. What happens if we change a character vector into a numeric?

#A warning is thrown and the entire column is filled with NA
head(as.numeric(smeta$Run))

Warning in head(as.numeric(smeta$Run)): NAs introduced by coercion

[1] NA NA NA NA NA NA

Some helpful things to remember

When you explicitly coerce one data type into another (this is known as explicit coercion), be careful to check the result. Ideally, you should try to see if it's possible to avoid steps in your analysis that force you to coerce.

R will sometimes coerce without you asking for it. This is called (appropriately) implicit coercion. For example [if you try] to create a vector with multiple data types, R [will choose] one type through implicit coercion.

Check the structure (str()) of your data frames before working with them! ---datacarpentry.org

Using `colnames()` to rename columns

colnames() will return a vector of column names from our data frame. We can use this vector and [] sub-setting to modify our column names.

For example, let's rename the column "SampleID" to "ID".

#Let's rename "SampleID" to "ID"
 colnames(smeta)[10] <- "ID" 

#if unsure of the index of a column, you could use which()
which(colnames(smeta)=="ID")

[1] 10

#or something like this
colnames(smeta)[colnames(smeta) == 
                          "ID"] <- "SampleID"

Subsetting data frames with base R

The tidyverse package dplyr makes it easy to subset data frames with select(), filter(), and slice(); however, it is still worth knowing how to subset data frames using Base R brackets.

Subsetting a data frame is similar to subsetting a vector; we can use bracket notation []. However, a data frame is two dimensional with both rows and columns, so we can specify either one argument or two arguments (e.g., df[row,column]) depending. If you provide one argument, columns will be assumed. This is because a data frame has characteristics of both a list and a matrix.

For now, let's focus on providing two arguments to subset. (Note when a data frame structure is returned)

smeta[2,4] #Returns the value in the 4th column and 2nd row

[1] "untrt"

smeta[2, ] #Returns a df with row two

  SampleName   cell dex albut        Run avgLength Experiment    Sample
2 GSM1275863 N61311 trt untrt SRR1039509       126  SRX384346 SRS508567
     BioSample SampleID
2 SAMN02422675        2

smeta[-1, ] #Returns a df without row 1

  SampleName    cell   dex albut        Run avgLength Experiment    Sample
2 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346 SRS508567
3 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349 SRS508571
4 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350 SRS508572
5 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353 SRS508575
6 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354 SRS508576
7 GSM1275874 N061011 untrt untrt SRR1039520       101  SRX384357 SRS508579
8 GSM1275875 N061011   trt untrt SRR1039521        98  SRX384358 SRS508580
     BioSample SampleID
2 SAMN02422675        2
3 SAMN02422678        3
4 SAMN02422670        4
5 SAMN02422682        5
6 SAMN02422673        6
7 SAMN02422683        7
8 SAMN02422677        8

smeta[1:4,1] #returns a vector of rows 1-4 of column 1

[1] "GSM1275862" "GSM1275863" "GSM1275866" "GSM1275867"

#call names of columns directly
smeta[1:5,c("Sample","avgLength")]

     Sample avgLength
1 SRS508568       126
2 SRS508567       126
3 SRS508571       126
4 SRS508572        87
5 SRS508575       120

#use comparison operators
smeta[smeta$SampleID == "2",]

  SampleName   cell dex albut        Run avgLength Experiment    Sample
2 GSM1275863 N61311 trt untrt SRR1039509       126  SRX384346 SRS508567
     BioSample SampleID
2 SAMN02422675        2

Subsetting Tibbles

Tibbles behave differently than data frames using base R accessors. See here for more information.

What happens when we provide a single argument?

#notice the difference here
smeta[,2] #returns column two

[1] "N61311"  "N61311"  "N052611" "N052611" "N080611" "N080611" "N061011"
[8] "N061011"

#treated similar to a matrix
#does not return a df if the output is a single column

smeta[2] #returns column two

#treated similar to a list; maintains the df structure.

Note

We can also use [[]] or $ for selecting specific columns.

Using `%in%`

%in% "returns a logical vector indicating if there is a match or not for its left operand". This logical vector can then be used to filter the data frame to only matched values.

Perhaps we only want to return a data frame with the following samples: "SRR1039508", "SRR1039513", "SRR1039520".

Using == is a bit tedious.

smeta[smeta$Run == "SRR1039508" | smeta$Run == "SRR1039513" | 
        smeta$Run == "SRR1039520",]

  SampleName    cell   dex albut        Run avgLength Experiment    Sample
1 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345 SRS508568
4 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350 SRS508572
7 GSM1275874 N061011 untrt untrt SRR1039520       101  SRX384357 SRS508579
     BioSample SampleID
1 SAMN02422669        1
4 SAMN02422670        4
7 SAMN02422683        7

Instead, we can create a vector of values to keep.

s_keep<- c("SRR1039508", "SRR1039513", "SRR1039520")
s_keep

[1] "SRR1039508" "SRR1039513" "SRR1039520"

We can then see where the values in our vector match values in our column smeta$Run.

smeta$Run %in% s_keep

[1]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE

[1] FALSE FALSE FALSE FALSE FALSE FALSE

We can further use this logical vector to filter our data frame by true values.

smeta[smeta$Run %in% s_keep, ]

  SampleName    cell   dex albut        Run avgLength Experiment    Sample
1 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345 SRS508568
4 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350 SRS508572
7 GSM1275874 N061011 untrt untrt SRR1039520       101  SRX384357 SRS508579
     BioSample SampleID
1 SAMN02422669        1
4 SAMN02422670        4
7 SAMN02422683        7

%in% can also be used with dplyr::filter() and subset().

Tips to remember for subsetting

Typically provide two values separated by commas: data.frame[row, column]

In cases where you are taking a continuous range of numbers use a colon between the numbers (start:stop, inclusive)

For a non continuous set of numbers, pass a vector using c()

Index using the name of a column(s) by passing them as vectors using c() ---datacarpentry.org

Info

Subsetting including simplifying vs preserving can get confusing. Here is a great chapter - though, a bit more advanced - that may clear things up if you are confused.

Data Wrangling

Part 2 of this course will focus on Data Wrangling. Learn how to filter, modify, summarize, and reshape your data. Check the BTEP calendar for updates on upcoming classes / courses.

Acknowledgements

Material from this lesson was either taken directly or adapted from Intro to R and RStudio for Genomics provided by datacarpentry.org.