Lesson 5: R Data Structures - Data Frames
Learning Objectives
This is the last lesson in Part 1 of Introductory R for Novices: Getting Started with R. This lesson will focus exclusively on working with data frames. Attendees will learn how to examine, summarize, and access data in data frames.
Specific learning objectives include:
- Review data import.
- Learn how to view and summarize data in a data frame.
- Learn how to use data accessors.
- Learn the syntax for sub-setting a data frame.
To get started with this lesson, you will first need to connect to RStudio on Biowulf. To connect to NIH HPC Open OnDemand, you must be on the NIH network. Use the following website to connect: https://hpcondemand.nih.gov/. Then follow the instructions outlined here.
Load the libraries
This lesson will use some functions from the tidyverse
.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Examining and summarizing data frames
All of the objects we imported in the previous lesson, were data frames. In this lesson, we will learn how to view and find out more information regarding the data stored in a data frame. Let's use the R object, smeta
as an example.
smeta<-read.delim("./data/airway_sampleinfo.txt")
head(smeta)
SampleName cell dex albut Run avgLength Experiment Sample
1 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345 SRS508568
2 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567
3 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349 SRS508571
4 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572
5 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353 SRS508575
6 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354 SRS508576
BioSample
1 SAMN02422669
2 SAMN02422675
3 SAMN02422678
4 SAMN02422670
5 SAMN02422682
6 SAMN02422673
We can view these data by clicking on the name of the object in the Environment
pane or by using View()
.
To understand more about the underlying structure of our data, we can use str()
or a similar function dplyr::glimpse
.
str(smeta)
'data.frame': 8 obs. of 9 variables:
$ SampleName: chr "GSM1275862" "GSM1275863" "GSM1275866" "GSM1275867" ...
$ cell : chr "N61311" "N61311" "N052611" "N052611" ...
$ dex : chr "untrt" "trt" "untrt" "trt" ...
$ albut : chr "untrt" "untrt" "untrt" "untrt" ...
$ Run : chr "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" ...
$ avgLength : int 126 126 126 87 120 126 101 98
$ Experiment: chr "SRX384345" "SRX384346" "SRX384349" "SRX384350" ...
$ Sample : chr "SRS508568" "SRS508567" "SRS508571" "SRS508572" ...
$ BioSample : chr "SAMN02422669" "SAMN02422675" "SAMN02422678" "SAMN02422670" ...
str()
shows us that we are looking at a data frame object with 8 rows by 9 columns. The column names are to the far left preceded by a $
. This is a data frame accessor, and we will see how this works later. We can also see the data types (e.g., character, integer, logical, double) after the column name. This will help us understand how we can transform and visualize the data in these columns.
We can also get an overview of summary statistics of this data frame using summary()
.
summary(smeta)
SampleName cell dex albut
Length:8 Length:8 Length:8 Length:8
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Run avgLength Experiment Sample
Length:8 Min. : 87.0 Length:8 Length:8
Class :character 1st Qu.:100.2 Class :character Class :character
Mode :character Median :123.0 Mode :character Mode :character
Mean :113.8
3rd Qu.:126.0
Max. :126.0
BioSample
Length:8
Class :character
Mode :character
Our data frame has 9 variables, so we get 9 fields that summarize the data. The only column with numerical data is avgLength
, for which we can see summary statistics on the min and max values as well as mean, median, and interquartile ranges.
Using summary()
with factors
summary()
is also useful for obtaining quick information about a categorial (factor) variable, answering how many groups and the sample size of each group.
smeta2 <- smeta %>% mutate(dex = as.factor(dex))
summary(smeta2)
SampleName cell dex albut
Length:8 Length:8 trt :4 Length:8
Class :character Class :character untrt:4 Class :character
Mode :character Mode :character Mode :character
Run avgLength Experiment Sample
Length:8 Min. : 87.0 Length:8 Length:8
Class :character 1st Qu.:100.2 Class :character Class :character
Mode :character Median :123.0 Mode :character Mode :character
Mean :113.8
3rd Qu.:126.0
Max. :126.0
BioSample
Length:8
Class :character
Mode :character
What is the length of our data.frame? What are the dimensions?
Other attributes we may want to know regarding our data frame include the number of columns (ncol()
, length()
) and the dimensions (dim()
).
#length returns the number of columns
length(smeta)
[1] 9
#dimensions, returns the row and column numbers
dim(smeta)
[1] 8 9
Other useful functions for inspecting data frames
Size:
nrow()
- number of rows
ncol()
- number of columns
Content:
head()
- returns first 6 rows by default
tail()
- returns last 6 rows by default
Names:
colnames()
- returns column names
rownames()
- returns row names
Section content from "Starting with Data", Introduction to data analysis with R and Bioconductor.
Data frame coercion and accessors
Let's pretend that the sample IDs were numeric rather than of type character.
smeta$SampleID <- c(1:nrow(smeta))
smeta
SampleName cell dex albut Run avgLength Experiment Sample
1 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345 SRS508568
2 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567
3 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349 SRS508571
4 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572
5 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353 SRS508575
6 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354 SRS508576
7 GSM1275874 N061011 untrt untrt SRR1039520 101 SRX384357 SRS508579
8 GSM1275875 N061011 trt untrt SRR1039521 98 SRX384358 SRS508580
BioSample SampleID
1 SAMN02422669 1
2 SAMN02422675 2
3 SAMN02422678 3
4 SAMN02422670 4
5 SAMN02422682 5
6 SAMN02422673 6
7 SAMN02422683 7
8 SAMN02422677 8
Unless stated otherwise, "SampleID" will be treated as numeric rather than as a character vector. If we intend to work with this column and treat it as an ID, we will need to convert it or coerce it to a character or factor vector.
We can access a column of our data frame using []
, [[]]
, or using the $
. These behave slightly differently, as we will see.
Let's access "SampleID" from smeta
.
#Using $
smeta$SampleID
[1] 1 2 3 4 5 6 7 8
#Using []
smeta["SampleID"]
SampleID
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
#Using [[]]
smeta[["SampleID"]]
[1] 1 2 3 4 5 6 7 8
Notice that $
and [[]]
behave similarly. These return a vector, while []
maintains the original structure, in this case a data frame.
Let's convert the "SampleID" column from an integer to a character vector. This is known as coercion.
#We can see that sample is being treated as numeric
is.numeric(smeta$SampleID)
[1] TRUE
#let's convert it to a character vector
smeta$SampleID<-as.character(smeta$SampleID)
#check this
is.character(smeta$SampleID)
[1] TRUE
#check this
is.numeric(smeta$SampleID)
[1] FALSE
See other related functions (e.g., as.factor()
,as.numeric()
).
Be careful with data coercion. What happens if we change a character vector into a numeric?
#A warning is thrown and the entire column is filled with NA
head(as.numeric(smeta$Run))
Warning in head(as.numeric(smeta$Run)): NAs introduced by coercion
[1] NA NA NA NA NA NA
Some helpful things to remember
- When you explicitly coerce one data type into another (this is known as explicit coercion), be careful to check the result. Ideally, you should try to see if it's possible to avoid steps in your analysis that force you to coerce.
- R will sometimes coerce without you asking for it. This is called (appropriately) implicit coercion. For example [if you try] to create a vector with multiple data types, R [will choose] one type through implicit coercion.
- Check the structure (
str()
) of your data frames before working with them! ---datacarpentry.org
Using colnames()
to rename columns
colnames()
will return a vector of column names from our data frame. We can use this vector and []
sub-setting to modify our column names.
For example, let's rename the column "SampleID" to "ID".
#Let's rename "SampleID" to "ID"
colnames(smeta)[10] <- "ID"
#if unsure of the index of a column, you could use which()
which(colnames(smeta)=="ID")
[1] 10
#or something like this
colnames(smeta)[colnames(smeta) ==
"ID"] <- "SampleID"
Subsetting data frames with base R
The tidyverse package dplyr
makes it easy to subset data frames with select()
, filter()
, and slice()
; however, it is still worth knowing how to subset data frames using Base R brackets.
Subsetting a data frame is similar to subsetting a vector; we can use bracket notation []
. However, a data frame is two dimensional with both rows and columns, so we can specify either one argument or two arguments (e.g., df[row,column]
) depending. If you provide one argument, columns will be assumed. This is because a data frame has characteristics of both a list and a matrix.
For now, let's focus on providing two arguments to subset. (Note when a data frame structure is returned)
smeta[2,4] #Returns the value in the 4th column and 2nd row
[1] "untrt"
smeta[2, ] #Returns a df with row two
SampleName cell dex albut Run avgLength Experiment Sample
2 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567
BioSample SampleID
2 SAMN02422675 2
smeta[-1, ] #Returns a df without row 1
SampleName cell dex albut Run avgLength Experiment Sample
2 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567
3 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349 SRS508571
4 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572
5 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353 SRS508575
6 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354 SRS508576
7 GSM1275874 N061011 untrt untrt SRR1039520 101 SRX384357 SRS508579
8 GSM1275875 N061011 trt untrt SRR1039521 98 SRX384358 SRS508580
BioSample SampleID
2 SAMN02422675 2
3 SAMN02422678 3
4 SAMN02422670 4
5 SAMN02422682 5
6 SAMN02422673 6
7 SAMN02422683 7
8 SAMN02422677 8
smeta[1:4,1] #returns a vector of rows 1-4 of column 1
[1] "GSM1275862" "GSM1275863" "GSM1275866" "GSM1275867"
#call names of columns directly
smeta[1:5,c("Sample","avgLength")]
Sample avgLength
1 SRS508568 126
2 SRS508567 126
3 SRS508571 126
4 SRS508572 87
5 SRS508575 120
#use comparison operators
smeta[smeta$SampleID == "2",]
SampleName cell dex albut Run avgLength Experiment Sample
2 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567
BioSample SampleID
2 SAMN02422675 2
Subsetting Tibbles
Tibbles behave differently than data frames using base R accessors. See here for more information.
What happens when we provide a single argument?
#notice the difference here
smeta[,2] #returns column two
[1] "N61311" "N61311" "N052611" "N052611" "N080611" "N080611" "N061011"
[8] "N061011"
#treated similar to a matrix
#does not return a df if the output is a single column
smeta[2] #returns column two
cell
1 N61311
2 N61311
3 N052611
4 N052611
5 N080611
6 N080611
7 N061011
8 N061011
#treated similar to a list; maintains the df structure.
Note
We can also use [[]]
or $
for selecting specific columns.
Using %in%
%in%
"returns a logical vector indicating if there is a match or not for its left operand". This logical vector can then be used to filter the data frame to only matched values.
Perhaps we only want to return a data frame with the following samples: "SRR1039508", "SRR1039513", "SRR1039520".
Using ==
is a bit tedious.
smeta[smeta$Run == "SRR1039508" | smeta$Run == "SRR1039513" |
smeta$Run == "SRR1039520",]
SampleName cell dex albut Run avgLength Experiment Sample
1 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345 SRS508568
4 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572
7 GSM1275874 N061011 untrt untrt SRR1039520 101 SRX384357 SRS508579
BioSample SampleID
1 SAMN02422669 1
4 SAMN02422670 4
7 SAMN02422683 7
Instead, we can create a vector of values to keep.
s_keep<- c("SRR1039508", "SRR1039513", "SRR1039520")
s_keep
[1] "SRR1039508" "SRR1039513" "SRR1039520"
We can then see where the values in our vector match values in our column smeta$Run
.
smeta$Run %in% s_keep
[1] TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
[1] FALSE FALSE FALSE FALSE FALSE FALSE
We can further use this logical vector to filter our data frame by true values.
smeta[smeta$Run %in% s_keep, ]
SampleName cell dex albut Run avgLength Experiment Sample
1 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345 SRS508568
4 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572
7 GSM1275874 N061011 untrt untrt SRR1039520 101 SRX384357 SRS508579
BioSample SampleID
1 SAMN02422669 1
4 SAMN02422670 4
7 SAMN02422683 7
%in%
can also be used with dplyr::filter()
and subset()
.
Tips to remember for subsetting
- Typically provide two values separated by commas: data.frame[row, column]
- In cases where you are taking a continuous range of numbers use a colon between the numbers (start:stop, inclusive)
- For a non continuous set of numbers, pass a vector using c()
- Index using the name of a column(s) by passing them as vectors using c() ---datacarpentry.org
Info
Subsetting including simplifying vs preserving can get confusing. Here is a great chapter - though, a bit more advanced - that may clear things up if you are confused.
Data Wrangling
Part 2 of this course will focus on Data Wrangling. Learn how to filter, modify, summarize, and reshape your data. Check the BTEP calendar for updates on upcoming classes / courses.
Acknowledgements
Material from this lesson was either taken directly or adapted from Intro to R and RStudio for Genomics provided by datacarpentry.org.