Lesson 4: Introduction to R Data Structures - Data Import
Learning Objectives
- Learn about data structures including factors, lists, matrices, and data frames.
- Learn how to import data in a tabular format (data frames)
- Learn to write out (export) data from the R environment
To get started with this lesson, you will first need to connect to RStudio on Biowulf. To connect to NIH HPC Open OnDemand, you must be on the NIH network. Use the following website to connect: https://hpcondemand.nih.gov/. Then follow the instructions outlined here.
Installing and Loading Packages
In this lesson, we will learn how to import data with different file extensions, including Excel files. We will make use of Base R functions for data import as well as popular functions from readr
and readxl
.
So far we have only worked with objects that we created in RStudio. We have not installed or loaded any packages. R packages extend the use of R programming beyond base R.
Where do we get R packages?
As a reminder, R packages are loadable extensions that contain code, data, documentation, and tests in a standardized shareable format that can easily be installed by R users. The primary repository for R packages is the Comprehensive R Archive Network (CRAN). CRAN is a global network of servers that store identical versions of R code, packages, documentation, etc (cran.r-project.org). To install a CRAN package, use install.packages()
.
Github is another common source used to store R packages; though, these packages do not necessarily meet CRAN standards so approach with caution. To install a Github package, use library(devtools)
followed by install_github()
. devtools
is a CRAN package. If you have not installed it, you may use install.packages("devtools")
prior to the previous steps.
Many genomics and other packages useful to biologists / molecular biologists can be found on Bioconductor. To install a Bioconductor package, you will first need to install BiocManager
, a CRAN package (install.packages("BiocManager")
). You can then use BiocManager
to install the Bioconductor core packages or any specific package (e.g., BiocManager::install("DESeq2")
).
Packages are installed into your file system at a given location denoted by .libPaths()
. This is your R library, a directory of installed R packages. To use one or more packages, you have to load it within your R session. This has to be done with each new R session.
Key functions:
install.packages()
install packages from CRAN.library()
load packages in R session.
Load the libraries:
library(readxl)
library(readr)
Tip
It is good practice to load libraries needed for a script at the beginning of the script.
Data Structures
Data structures are objects that store data.
Previously, we learned that vectors are collections of values of the same type. A vector is also one of the most basic data structures.
Other common data structures in R include:
- factors
- lists
- data frames
- matrices
What are factors?
Factors are an important data structure in statistical computing. They are specialized vectors (ordered or unordered) for the storage of categorical data (data with fixed values). While they appear to be character vectors, data in factors are stored as integers. These integers are associated with pre-defined levels, which represent the different groups or categories in the vector.
Reference level
Generally for statistical models, the reference or control level is set to level 1. You can reorder the levels using factor()
or forcats::relevel()
.
Important functions
factor()
- to create a factor and reorder levelsas.factor()
- to coerce to a factorlevels()
- view and / or rename the levels of a factornlevels()
- return the number of levels
For example:
sex <- factor(c("M","F","F","M","M","M"))
levels(sex)
[1] "F" "M"
Check out the package forcats
for managing and reordering factors.
Note
R will organize factor levels alphabetically by default. This will be especially noticeable when plotting.
Warning
Pay attention when coercing from a factor to a numeric. To do this, you should first convert to a character vector. Otherwise, the numbers that you want to be numeric (the factor level names) will be returned as integers.
See more about working with factors here.
Lists
Unlike an atomic vector, a list can contain multiple elements of different types, (e.g., character vector, numeric vector, list, data frame, matrix). Lists are not the focus of this lesson, but you should be aware of them, as you will likely come across them at some point, as many functions, including those specific to bioinformatics, may output data in the form of a list.
Important functions
list()
- create a listnames()
- create named elements (Also useful for vectors)lapply()
,sapply()
- for looping over elements of the list
Example
#Create a list
My_exp <- list(c("N052611", "N061011", "N080611", "N61311" ),
c("SRR1039508", "SRR1039509", "SRR1039512",
"SRR1039513", "SRR1039516", "SRR1039517",
"SRR1039520", "SRR1039521"),c(100,200,300,400))
#Look at the structure
str(My_exp)
List of 3
$ : chr [1:4] "N052611" "N061011" "N080611" "N61311"
$ : chr [1:8] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" ...
$ : num [1:4] 100 200 300 400
#Name the elements of the list
names(My_exp)<-c("cell_lines","sample_id","counts")
#See how the structure changes
str(My_exp)
List of 3
$ cell_lines: chr [1:4] "N052611" "N061011" "N080611" "N61311"
$ sample_id : chr [1:8] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" ...
$ counts : num [1:4] 100 200 300 400
#Subset the list
My_exp[[1]][2]
[1] "N061011"
My_exp$cell_lines[2]
[1] "N061011"
#Apply a function (remove the first index from each vector)
lapply(My_exp,function(x){x[-1]})
$cell_lines
[1] "N061011" "N080611" "N61311"
$sample_id
[1] "SRR1039509" "SRR1039512" "SRR1039513" "SRR1039516" "SRR1039517"
[6] "SRR1039520" "SRR1039521"
$counts
[1] 200 300 400
We are not going to spend a lot of time on lists, but you should consider learning more about them in the future, as you may receive output at some point in the form of a list. For a brief introduction to lists, see the following resources:
Data Matrices
Another important data structure in R is the data matrix. Data frames and data matrices are similar in that both are tabular in nature and are defined by dimensions (i.e., rows (m) and columns (n), commonly denoted m x n). However, a matrix contains only values of a single type (i.e., numeric, character, logical, etc.).
Note
A vector can be viewed as a 1 dimensional matrix.
Elements in a matrix and a data frame can be referenced by using their row and column indices (for example, a[1,1] references the element in row 1 and column 1).
Below, we create the object a1, a 3-row by 4-column matrix.
a1 <- matrix(c(3,4,2,4,6,3,8,1,7,5,3,2), ncol=4)
a1
[,1] [,2] [,3] [,4]
[1,] 3 4 8 5
[2,] 4 6 1 3
[3,] 2 3 7 2
Using the typeof() and class() command, we see that the elements in a1 are double and a1 a matrix, respectively.
typeof(a1)
[1] "double"
class(a1)
[1] "matrix" "array"
Similar to lists, we aren't going to focus much on matrices.
Data Frames: Working with Tabular Data
In genomics, we work with a lot of tabular data - data organized in rows and columns. The data structure that stores this type of data is a data frame. Data frames are collections of vectors of the same length but can be of different types. Because we often have data of multiple types, it is natural to examine that data in a data frame.
You may be tempted to open and manually work with these data in excel. However, there are a number of reasons why this can be to your detriment. First, it is very easy to make mistakes when working with large amounts of tabular data in excel. Have you ever mistakenly left out a column or row while sorting data? Second, many of the files that we work with are so large (big data) that excel and your local machine do not have the bandwidth to handle them. Third, you will likely need to apply analyses that are unavailable in excel. Lastly, it is difficult to keep track of any data manipulation steps or analyses in a point and click environment like excel.
R, on the other hand, can make analyzing tabular data more efficient and reproducible. But before getting into working with this data in R, let's review some best practices for data management.
Best Practices for organizing genomic data
- "Keep raw data separate from analyzed data" -- datacarpentry.org
For large genomic data sets, you may want to include a project folder with two main subdirectories (i.e., raw_data and data_analysis). You may even consider changing the permissions (check out the unix command chmod
) in your raw directory to make those files read only. Keeping raw data separate is not a problem in R, as one must explicitly import and export data.
- "Keep spreadsheet data Tidy" -- datacarpentry.org
Data organization can be frustrating, and many scientists devote a great deal of time and energy toward this task. Keeping data tidy, can make data science more efficient, effective, and reproducible. There is a collection of packages in R that embrace the philosophy of tidy data and facilitate working with data frames. That collection is known as the tidyverse.
- "Trust but verify" -- datacarpentry.org
R makes data analysis more reproducible and can eliminate some mistakes from human error. However, you should approach data analysis with a plan, and make sure you understand what a function is doing before applying it to your data. Often using small subsets of data can be used as a form of data debugging to make sure the expected result materialized.
Some functions for creating practice data include: data.frame()
, rep()
, seq()
, rnorm()
, sample()
and others. See some examples here.
Let's use some of these to create a data frame.
df<-data.frame(Samples=c(1:10),Counts=sample(1:5000, size=10, replace = TRUE),Treatment=rep(c("control", "treated"), each=5))
df
Samples Counts Treatment
1 1 4939 control
2 2 191 control
3 3 3697 control
4 4 4933 control
5 5 2938 control
6 6 1721 treated
7 7 214 treated
8 8 2999 treated
9 9 2084 treated
10 10 2196 treated
Example Data
There are data sets available in R to practice with or showcase different packages; for example, library(help = "datasets")
. For the next two lessons, we will use data derived from the Bioconductor package airway
as well as data internal to or derived from Base R and packages within the tidyverse. Check out the Acknowledgements section for additional data sources.
Obtaining the data
-
To download the data used in this lesson to your local computer, click here. You can then move the downloaded directory to your working directory in R.
-
To use the data on Biowulf, open your Terminal in R and follow these steps:
cd /data/$USER/Getting_Started_with_R
wget https://bioinformatics.ccr.cancer.gov/docs/r_for_novices/Getting_Started_with_R/data.zip
unzip data.zip
Note
"Getting_Started_with_R" is the name of the project directory I created in Lesson 1. If you do not have this directory, make sure you change directories to your working directory in R.
Importing Data
Before we can do anything with our data, we need to first import it into R. There are several ways to do this.
First, the RStudio IDE has a drop down menu for data import. Simply go to File
> Import Dataset
and select one of the options and follow the prompts.
Pay close attention to the import functions and their arguments. Using the import arguments correctly can save you from a headache later down the road. You will notice two types of import functions under Import Dataset
"from text": base R import functions and readr
import functions. We will use both in this course.
Row names
Tidyverse
packages are generally against assigning rownames
and instead prefer that all column data are treated the same, but there are times when this is beneficial and will be required for genomics data (e.g., See SummarizedExperiment
from Bioconductor).
What is a tibble?
When loading tabular data with readr
, the default object created will be a tibble
. Tibbles are like data frames with some small but apparent modifications. For example, they can have numbers for column names, and the column types are immediately apparent when viewing. Additionally, when you call a tibble by running the object name, the entire data frame does not print to the screen, rather the first ten rows along with the columns that fit the screen are shown.
Reasons to use readr
functions
Compared to the corresponding base functions, readr functions:
Use a consistent naming scheme for the parameters (e.g. col_names and col_types not header and colClasses).
Are generally much faster (up to 10x-100x) depending on the dataset.
Leave strings as is by default, and automatically parse common date/time formats.
Have a helpful progress bar if loading is going to take a while.
All functions work exactly the same way regardless of the current locale. To override the US-centric defaults, use locale(). - readr.tidyverse.org.
Excel files (.xls, .xlsx)
Excel files are the primary means by which many people save spreadsheet data. .xls or .xlsx files store workbooks composed of one or more spreadsheets.
Importing excel files requires the R package readxl
. While this is a tidyverse package, it is not core and must be loaded separately. We loaded this above.
The functions to import excel files are read_excel()
, read_xls()
, and read_xlsx()
. The latter two are more specific based on file format, whereas the first will guess which format (.xls or .xlsx) we are working with.
Let's look at its basic usage using an example data set from the readxl
package. To access the example data we use readxl_example()
.
#makes example data accessible by storing the path
ex_xl<-readxl_example("datasets.xlsx")
ex_xl
[1] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/readxl/extdata/datasets.xlsx"
Now, let's read in the data. The only required argument is a path to the file to be imported.
irisdata<-read_excel(ex_xl)
irisdata
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
Notice that the resulting imported data is a tibble. This is a feature specific to tidyverse. Now, let's check out some of the additional arguments. We can view the help information using ?read_excel()
.
The arguments likely to be most pertinent to you are:
sheet
- the name or numeric position of the excel sheet to read.
col_names
- default TRUE uses the first read in row for the column names. You can also provide a vector of names to name the columns.
skip
- will allow us to skip rows that we do not wish to read in.
.name_repair
- automatically set to "unique", which makes sure that the column names are not empty and are all unique. read_excel()
and readr
functions will not correct column names to make them syntactic. If you want corrected names, use .name_repair = "universal"
.
Let's check out another example:
sum_air<-read_excel("./data/RNASeq_totalcounts_vs_totaltrans.xlsx")
New names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
sum_air
# A tibble: 11 × 4
`Uses Airway Data` ...2 ...3 ...4
<chr> <chr> <chr> <chr>
1 Some RNA-Seq summary information <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA>
3 Sample Name Treatment Number of Transcripts Total C…
4 GSM1275863 Dexamethasone 10768 18783120
5 GSM1275867 Dexamethasone 10051 15144524
6 GSM1275871 Dexamethasone 11658 30776089
7 GSM1275875 Dexamethasone 10900 21135511
8 GSM1275862 None 11177 20608402
9 GSM1275866 None 11526 25311320
10 GSM1275870 None 11425 24411867
11 GSM1275874 None 11000 19094104
Upon importing these data, we can immediately see that something is wrong with the column names.
colnames(sum_air)
[1] "Uses Airway Data" "...2" "...3" "...4"
There are some extra rows of information at the beginning of the data frame that should be excluded. We can take advantage of additional arguments to load only the data we are interested in. We are also going to tell read_excel()
that we want the names repaired to eliminate spaces.
sum_air<-read_excel("./data/RNASeq_totalcounts_vs_totaltrans.xlsx",
skip=3,.name_repair = "universal")
New names:
• `Sample Name` -> `Sample.Name`
• `Number of Transcripts` -> `Number.of.Transcripts`
• `Total Counts` -> `Total.Counts`
sum_air
# A tibble: 8 × 4
Sample.Name Treatment Number.of.Transcripts Total.Counts
<chr> <chr> <dbl> <dbl>
1 GSM1275863 Dexamethasone 10768 18783120
2 GSM1275867 Dexamethasone 10051 15144524
3 GSM1275871 Dexamethasone 11658 30776089
4 GSM1275875 Dexamethasone 10900 21135511
5 GSM1275862 None 11177 20608402
6 GSM1275866 None 11526 25311320
7 GSM1275870 None 11425 24411867
8 GSM1275874 None 11000 19094104
Tab-delimited files (.tsv, .txt)
In tab delimited files, data columns are separated by tabs.
To import tab-delimited files there are several options. There are base R functions such as read.delim()
and read.table()
as well as the readr
functions read_delim()
, read_tsv()
, and read_table()
.
Let's take a look at ?read.delim()
and ?read_delim()
, which are most appropriate if you are working with tab delimited data stored in a .txt file.
For read.delim()
, you will notice that the default separator (sep
) is white space, which can be one or more spaces, tabs, newlines. However, you could use this function to load a comma separated file as well; you simply need to use sep = ","
. The same is true of read_delim()
, except the argument is delim
rather than sep
.
Let's load sample information from the RNA-Seq project airway
. We will refer back to some of these data frequently throughout our lessons. The airway data is from Himes et al. (2014). These data, which are available in R as a RangedSummarizedExperiment
object, are from a bulk RNA-Seq experiment. In the experiment, the authors "characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone," a common therapy for asthma. The airway package includes RNAseq count data from 8 airway smooth muscle cell samples. Each cell line includes a treated and untreated negative control.
Using read.delim()
:
smeta<-read.delim("./data/airway_sampleinfo.txt")
head(smeta)
SampleName cell dex albut Run avgLength Experiment Sample
1 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345 SRS508568
2 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567
3 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349 SRS508571
4 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572
5 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353 SRS508575
6 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354 SRS508576
BioSample
1 SAMN02422669
2 SAMN02422675
3 SAMN02422678
4 SAMN02422670
5 SAMN02422682
6 SAMN02422673
Some other arguments of interest for read.delim()
:
row.names
- used to specify row names.
col.names
- use to specify column names if header = FALSE
.
skip
- Similar to read_excel()
, used to skip a number of lines preceding the data we are interested in importing.
check.names
- makes names syntactically valid and unique.
Using read_delim()
:
smeta2<-read_delim("./data/airway_sampleinfo.txt")
Rows: 8 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
dbl (1): avgLength
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
smeta2
# A tibble: 8 × 9
SampleName cell dex albut Run avgLength Experiment Sample BioSample
<chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 GSM1275862 N61311 untrt untrt SRR10395… 126 SRX384345 SRS50… SAMN0242…
2 GSM1275863 N61311 trt untrt SRR10395… 126 SRX384346 SRS50… SAMN0242…
3 GSM1275866 N052611 untrt untrt SRR10395… 126 SRX384349 SRS50… SAMN0242…
4 GSM1275867 N052611 trt untrt SRR10395… 87 SRX384350 SRS50… SAMN0242…
5 GSM1275870 N080611 untrt untrt SRR10395… 120 SRX384353 SRS50… SAMN0242…
6 GSM1275871 N080611 trt untrt SRR10395… 126 SRX384354 SRS50… SAMN0242…
7 GSM1275874 N061011 untrt untrt SRR10395… 101 SRX384357 SRS50… SAMN0242…
8 GSM1275875 N061011 trt untrt SRR10395… 98 SRX384358 SRS50… SAMN0242…
What if we want to retain row names?
Let's load in a count matrix from airway
.
aircount<-read.delim("./data/head50_airway_nonnorm_count.txt")
head(aircount)
X Accession.SRR1039508 Accession.SRR1039509
1 ENSG00000000003.TSPAN6 679 448
2 ENSG00000000005.TNMD 0 0
3 ENSG00000000419.DPM1 467 515
4 ENSG00000000457.SCYL3 260 211
5 ENSG00000000460.C1orf112 60 55
6 ENSG00000000938.FGR 0 0
Accession.SRR1039512 Accession.SRR1039513 Accession.SRR1039516
1 873 408 1138
2 0 0 0
3 621 365 587
4 263 164 245
5 40 35 78
6 2 0 1
Accession.SRR1039517 Accession.SRR1039520 Accession.SRR1039521
1 1047 770 572
2 0 0 0
3 799 417 508
4 331 233 229
5 63 76 60
6 0 0 0
Because this is a count matrix, we want to save column 'X', which was automatically named, as row names rather than a column. Remember, readr
is a part of the tidyverse and does not play well with row names. Therefore, we will use read.delim()
withe the argument row.names
.
Let's reload and overwrite the previous object:
aircount<-read.delim("./data/head50_airway_nonnorm_count.txt",
row.names = 1)
head(aircount)
Accession.SRR1039508 Accession.SRR1039509
ENSG00000000003.TSPAN6 679 448
ENSG00000000005.TNMD 0 0
ENSG00000000419.DPM1 467 515
ENSG00000000457.SCYL3 260 211
ENSG00000000460.C1orf112 60 55
ENSG00000000938.FGR 0 0
Accession.SRR1039512 Accession.SRR1039513
ENSG00000000003.TSPAN6 873 408
ENSG00000000005.TNMD 0 0
ENSG00000000419.DPM1 621 365
ENSG00000000457.SCYL3 263 164
ENSG00000000460.C1orf112 40 35
ENSG00000000938.FGR 2 0
Accession.SRR1039516 Accession.SRR1039517
ENSG00000000003.TSPAN6 1138 1047
ENSG00000000005.TNMD 0 0
ENSG00000000419.DPM1 587 799
ENSG00000000457.SCYL3 245 331
ENSG00000000460.C1orf112 78 63
ENSG00000000938.FGR 1 0
Accession.SRR1039520 Accession.SRR1039521
ENSG00000000003.TSPAN6 770 572
ENSG00000000005.TNMD 0 0
ENSG00000000419.DPM1 417 508
ENSG00000000457.SCYL3 233 229
ENSG00000000460.C1orf112 76 60
ENSG00000000938.FGR 0 0
Comma separated files (.csv)
In comma separated files the columns are separated by commas and the rows are separated by new lines.
To read comma separated files, we can use the specific functions ?read.csv()
and ?read_csv()
.
Let's see this in action:
cexamp<-read.csv("./data/surveys_datacarpentry.csv")
head(cexamp)
record_id month day year plot_id species_id sex hindfoot_length weight
1 1 7 16 1977 2 NL M 32 NA
2 2 7 16 1977 3 NL M 33 NA
3 3 7 16 1977 2 DM F 37 NA
4 4 7 16 1977 7 DM M 36 NA
5 5 7 16 1977 3 DM M 35 NA
6 6 7 16 1977 1 PF M 14 NA
The arguments are the same as read.delim()
.
Let's check out read_csv()
:
cexamp2<-read_csv("./data/surveys_datacarpentry.csv")
Rows: 35549 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): species_id, sex
dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cexamp2
# A tibble: 35,549 × 9
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 2 NL M 32 NA
2 2 7 16 1977 3 NL M 33 NA
3 3 7 16 1977 2 DM F 37 NA
4 4 7 16 1977 7 DM M 36 NA
5 5 7 16 1977 3 DM M 35 NA
6 6 7 16 1977 1 PF M 14 NA
7 7 7 16 1977 2 PE F NA NA
8 8 7 16 1977 1 DM M 37 NA
9 9 7 16 1977 1 DM F 34 NA
10 10 7 16 1977 6 PF F 20 NA
# ℹ 35,539 more rows
Other file types
There are a number of other file types you may be interested in. For genomic specific formats, you will likely need to install specific packages; check out Bioconductor for packages relevant to bioinformatics.
For information on importing other files types (e.g., json, xml, google sheets), check out this chapter from Tidyverse Skills for Data Science by Carrie Wright, Shannon E. Ellis, Stephanie C. Hicks and Roger D. Peng.
Data Export.
To export data to file, you will use similar functions (write.table()
,write.csv()
,saveRDS()
, etc.).
For example, let's save df
to a csv file.
write_csv(df,"./data/small_df_example.csv")
Acknowledgements
Some material from this lesson was either taken directly or adapted from Intro to R and RStudio for Genomics provided by datacarpentry.org. Other material from this lesson was inspired by R4DS and Tidyverse Skills for Data Science. The survey data loaded in this lesson was taken from datacarpentry.org.