Data import and reshape
Lesson Objectives
- Briefly review how to import data
- Data reshape with
tidyr
:pivot_longer()
,pivot_wider()
,separate()
, andunite()
To get started with this lesson, you will first need to connect to RStudio on Biowulf. To connect to NIH HPC Open OnDemand, you must be on the NIH network. Use the following website to connect: https://hpcondemand.nih.gov/. Then follow the instructions outlined here.
Load the tidyverse
We will use core packages from the tidyverse for our data wrangling needs. Data reshaping primarily involves the tidyverse package, tidyr
, but we will use additional packages as well, such as tibble
.
Packages must be loaded with each new R session.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Importing data
Before we can do anything with our data, we need to first import it into R.
We can either use the data import from the RStudio drop-down menu (File
> Import Dataset
), or we can use R functions for reading in data (Recommended). These functions generally start with read
. The Base R read functions are followed by a .
, while the readr
functions are followed by an _
. readr
functions are from the readr
package, which is a part of the tidyverse. readr
functions are typically faster, more reproducible and consistent, and are better at recognizing certain types of data (e.g., dates). However, they also result in tibbles
rather than data frames, and are not row name friendly.
Info
Tibbles are like data frames with some small but apparent modifications. For example, they can have numbers for column names, and the column types are immediately apparent when viewing. Additionally, when you call a tibble by running the object name, the entire data frame does not print to the screen, rather the first ten rows along with the columns that fit the screen are shown.
Some different import functions
Import Excel files:
- readxl::read_excel()
.
- readxl::read_xls()
.
- readxl::read_xlsx()
Import tab-delimited files (.tsv, .txt):
- read.delim()
- read.table()
.
- readr::read_delim()
.
- readr::read_tsv()
- readr::read_table()
Comma separated files (.csv):
- read.csv()
- readr::read_csv()
The most important argument of all of these functions is the file path
.
File paths
A file path tells us the location of a file or folder (a.k.a., directory). Because it is a character string, it must be surrounded by quotes. Each directory is separated by a /
. It is best practice to work in R projects and use relative file paths to make scripts more reproducible.
Genomic Data:
- For genomic specific formats, you will likely need to install specific packages; check out Bioconductor for packages relevant to bioinformatics.
Other:
- For information on importing other files types (e.g., json, xml, google sheets), check out this chapter from Tidyverse Skills for Data Science by Carrie Wright, Shannon E. Ellis, Stephanie C. Hicks and Roger D. Peng.
Load the lesson data
For today's lesson, we will work with data available from R (Base R and the tidyverse) as well as an example RNA-Seq count matrix. The count matrix is currently in the format "genes x samples", with the gene IDs, which are a combination of Ensembl IDs and gene symbols, as row names.
Get the Data
To download the data used in this lesson and future lessons to your local computer, click here. You can then move the downloaded directory to your working directory in R.
To use the data on Biowulf, open your Terminal in R and follow these steps:
# change to your working directory
cd /data/$USER/Data_Wrangling_with_R
# use wget to grab the zipped directory
wget https://bioinformatics.ccr.cancer.gov/docs/r_for_novices/Intro_to_Data_Wrangling/data.zip
# unzip the data
unzip -d data data.zip
Load the Data
Let's use read.delim
to load the data.
aircount<-read.delim("./data/head50_airway_nonnorm_count.txt",
row.names = 1)
head(aircount)
Accession.SRR1039508 Accession.SRR1039509
ENSG00000000003.TSPAN6 679 448
ENSG00000000005.TNMD 0 0
ENSG00000000419.DPM1 467 515
ENSG00000000457.SCYL3 260 211
ENSG00000000460.C1orf112 60 55
ENSG00000000938.FGR 0 0
Accession.SRR1039512 Accession.SRR1039513
ENSG00000000003.TSPAN6 873 408
ENSG00000000005.TNMD 0 0
ENSG00000000419.DPM1 621 365
ENSG00000000457.SCYL3 263 164
ENSG00000000460.C1orf112 40 35
ENSG00000000938.FGR 2 0
Accession.SRR1039516 Accession.SRR1039517
ENSG00000000003.TSPAN6 1138 1047
ENSG00000000005.TNMD 0 0
ENSG00000000419.DPM1 587 799
ENSG00000000457.SCYL3 245 331
ENSG00000000460.C1orf112 78 63
ENSG00000000938.FGR 1 0
Accession.SRR1039520 Accession.SRR1039521
ENSG00000000003.TSPAN6 770 572
ENSG00000000005.TNMD 0 0
ENSG00000000419.DPM1 417 508
ENSG00000000457.SCYL3 233 229
ENSG00000000460.C1orf112 76 60
ENSG00000000938.FGR 0 0
The first thing we should do following data import is to examine the data. We need to know what is included in this data frame. What are the dimensions? What types of data are stored in each column?
How can we examine these data further?
str(aircount)
'data.frame': 50 obs. of 8 variables:
$ Accession.SRR1039508: int 679 0 467 260 60 0 3251 1433 519 394 ...
$ Accession.SRR1039509: int 448 0 515 211 55 0 3679 1062 380 236 ...
$ Accession.SRR1039512: int 873 0 621 263 40 2 6177 1733 595 464 ...
$ Accession.SRR1039513: int 408 0 365 164 35 0 4252 881 493 175 ...
$ Accession.SRR1039516: int 1138 0 587 245 78 1 6721 1424 820 658 ...
$ Accession.SRR1039517: int 1047 0 799 331 63 0 11027 1439 714 584 ...
$ Accession.SRR1039520: int 770 0 417 233 76 0 5176 1359 696 360 ...
$ Accession.SRR1039521: int 572 0 508 229 60 0 7995 1109 704 269 ...
glimpse(aircount)
Rows: 50
Columns: 8
$ Accession.SRR1039508 <int> 679, 0, 467, 260, 60, 0, 3251, 1433, 519, 394, 17…
$ Accession.SRR1039509 <int> 448, 0, 515, 211, 55, 0, 3679, 1062, 380, 236, 16…
$ Accession.SRR1039512 <int> 873, 0, 621, 263, 40, 2, 6177, 1733, 595, 464, 26…
$ Accession.SRR1039513 <int> 408, 0, 365, 164, 35, 0, 4252, 881, 493, 175, 118…
$ Accession.SRR1039516 <int> 1138, 0, 587, 245, 78, 1, 6721, 1424, 820, 658, 2…
$ Accession.SRR1039517 <int> 1047, 0, 799, 331, 63, 0, 11027, 1439, 714, 584, …
$ Accession.SRR1039520 <int> 770, 0, 417, 233, 76, 0, 5176, 1359, 696, 360, 15…
$ Accession.SRR1039521 <int> 572, 0, 508, 229, 60, 0, 7995, 1109, 704, 269, 17…
Data reshape
Now that we have some data to work with, let's learn how we can reshape it. Recall how we defined tidy data.
Specifically, tidy data has 3 components:
- Each column is a variable, a quantity, quality, or property that can be collected or measured.
- Each row is an observation, or set of values collected under similar conditions.
- Each cell is a value, or state of a variable when you measure it. --- r4ds
We can organize data in many different ways. Some of these ways will be easier to work with, generally the tidy way.
What do we mean by reshaping data?
Data reshaping is one aspect of tidying our data. The shape of our data is determined by how values are organized across rows and columns. When reshaping data, we are most often wrangling the data from wide to long format or vice versa. To tidy the data we will need to (1) know the difference between observations and variables, and (2) potentially resolve cases in which a single variable is spread across multiple columns or a single observation is spread across multiple rows R4DS.
It is difficult to provide a single definition for what is wide data vs long data, as both can take different forms, and both can be considered tidy depending on the circumstance (e.g., analysis goals).
Note
While we are interested in getting data into a "tidy" format, your data should ultimately be wrangled into a format that is going to work with downstream analyses.
In general, in wide data there is often a single metric spread across multiple columns. This type of data often, but not always, takes on a matrix like appearance.
While in long data, each variable tends to have its own column.
See this example from R4DS:
However, these definitions depend on what you are ultimately considering a variable and what you are considering an observation.
For example, which of the following data representations is the tidy option?
Wide format:
tibble(iris)
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
Long format:
iris_long<-tibble(iris) %>%
rownames_to_column("Iris_id") %>%
pivot_longer(2:5,names_to="Flower_property",values_to="Measurement")
With the long format it is easier to summarize information about the properties of the flowers but in the wide format it is easier to explore relationships between these properties.
For example, this code is simpler
iris_long %>%
group_by(Species, Flower_property) %>%
summarize(mean= mean(Measurement), sd = sd(Measurement))
`summarise()` has grouped output by 'Species'. You can override using the
`.groups` argument.
# A tibble: 12 × 4
# Groups: Species [3]
Species Flower_property mean sd
<fct> <chr> <dbl> <dbl>
1 setosa Petal.Length 1.46 0.174
2 setosa Petal.Width 0.246 0.105
3 setosa Sepal.Length 5.01 0.352
4 setosa Sepal.Width 3.43 0.379
5 versicolor Petal.Length 4.26 0.470
6 versicolor Petal.Width 1.33 0.198
7 versicolor Sepal.Length 5.94 0.516
8 versicolor Sepal.Width 2.77 0.314
9 virginica Petal.Length 5.55 0.552
10 virginica Petal.Width 2.03 0.275
11 virginica Sepal.Length 6.59 0.636
12 virginica Sepal.Width 2.97 0.322
than this
iris %>% group_by(Species) %>%
summarize(across(where(is.numeric),list(mean = mean, sd=sd)))
# A tibble: 3 × 9
Species Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean Sepal.Width_sd
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 0.352 3.43 0.379
2 versicolor 5.94 0.516 2.77 0.314
3 virginica 6.59 0.636 2.97 0.322
# ℹ 4 more variables: Petal.Length_mean <dbl>, Petal.Length_sd <dbl>,
# Petal.Width_mean <dbl>, Petal.Width_sd <dbl>
Regardless, you may want one format or the other depending on your analysis goals. Many of the tidyverse tools (e.g., ggplot2
) seem to work better with long format data, but this again, will depend on your task.
The tools we use to go from wide to long and long to wide are from the package tidyr
. Because we already loaded the package tidyverse
, we do not need to load tidyr
, as it is a core package.
pivot_wider()
and pivot_longer()
pivot_wider()
and pivot_longer()
have replaced the functions gather()
and spread()
. pivot_wider()
converts long format data to wide, while pivot_longer()
converts wide format data to long.
If you haven't guessed already, our count matrix is currently in wide format. If we wanted to merge these data with sample metadata and plot various aspects of the data using ggplot2, we would likely want these data in long format.
Pivot_longer
Let's check out the help documentation ?pivot_longer()
. This function requires the data
and the columns we want to combine (cols
). There are also a number of optional arguments involving the name column and the value column.
For the cols
argument, we can select columns using the same arguments we would use with select()
, including column names, indices, or the select
helper functions, for example,
contains()
,
starts_with()
,
ends_with()
,
etc.
Columns in the Tidyverse
In Base R, we often have to refer to data variables (columns) directly using an accessor like $
. However, this is not the case in the tidyverse. In the tidyverse, columns that exist generally do not need quotes, while columns that do not yet exist do need quotes. This difference has important implications for creating for loops and functions. Learn more about tidy evaluation here.
Let's pivot aircount
.
l_air<-pivot_longer(aircount,1:length(aircount),names_to ="Sample",
values_to= "Count")
head(l_air)
# A tibble: 6 × 2
Sample Count
<chr> <int>
1 Accession.SRR1039508 679
2 Accession.SRR1039509 448
3 Accession.SRR1039512 873
4 Accession.SRR1039513 408
5 Accession.SRR1039516 1138
6 Accession.SRR1039517 1047
Notice that the row names were dropped. While we would want to keep row names if we were working with this matrix as is, because we want a long data frame, we will need to first put the row names into a column. For this, we will use rownames_to_column()
from the tidyverse package tibble.
#save row names as a column
aircount<-rownames_to_column(aircount,"Feature")
head(aircount["Feature"])
Feature
1 ENSG00000000003.TSPAN6
2 ENSG00000000005.TNMD
3 ENSG00000000419.DPM1
4 ENSG00000000457.SCYL3
5 ENSG00000000460.C1orf112
6 ENSG00000000938.FGR
#pivot longer...again
l_air<-pivot_longer(aircount,starts_with("Accession"),
names_to =c("Sample"),values_to= "Count")
head(l_air)
# A tibble: 6 × 3
Feature Sample Count
<chr> <chr> <int>
1 ENSG00000000003.TSPAN6 Accession.SRR1039508 679
2 ENSG00000000003.TSPAN6 Accession.SRR1039509 448
3 ENSG00000000003.TSPAN6 Accession.SRR1039512 873
4 ENSG00000000003.TSPAN6 Accession.SRR1039513 408
5 ENSG00000000003.TSPAN6 Accession.SRR1039516 1138
6 ENSG00000000003.TSPAN6 Accession.SRR1039517 1047
Pivot_wider
How can we get this back to a wide format? We can use ?pivot_wider()
. This requires two additional arguments beyond the data argument: names_from
and values_from
. The first, names_from
should be the name of the column containing the new column names for your wide data. values_from
is the column that contains the values to fill the rows of your wide data columns. Because these columns already exist, we do not need to put them in quotes.
Let's pivot the data from long to wide.
w_air<-pivot_wider(l_air,names_from = Sample,
values_from = Count)
head(w_air)
# A tibble: 6 × 9
Feature Accession.SRR1039508 Accession.SRR1039509 Accession.SRR1039512
<chr> <int> <int> <int>
1 ENSG0000000000… 679 448 873
2 ENSG0000000000… 0 0 0
3 ENSG0000000041… 467 515 621
4 ENSG0000000045… 260 211 263
5 ENSG0000000046… 60 55 40
6 ENSG0000000093… 0 0 2
# ℹ 5 more variables: Accession.SRR1039513 <int>, Accession.SRR1039516 <int>,
# Accession.SRR1039517 <int>, Accession.SRR1039520 <int>,
# Accession.SRR1039521 <int>
Note
There are many optional arguments for both of these functions. These are there to help you reshape seemingly complicated data schemes. Don't get discouraged. The examples in the help documentation are extremely helpful.
Test our knowledge
What function would we use to transform table A to table B?
Table A:
# A tibble: 19 × 12
fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 4842 1 1 1 1 1 1 1 1 1 1 1
2 4843 1 1 1 1 1 1 1 1 1 1 1
3 4844 1 1 1 1 1 1 1 1 1 1 1
4 4845 1 1 1 1 1 NA NA NA NA NA NA
5 4847 1 1 1 NA NA NA NA NA NA NA NA
6 4848 1 1 1 1 NA NA NA NA NA NA NA
7 4849 1 1 NA NA NA NA NA NA NA NA NA
8 4850 1 1 NA 1 1 1 1 NA NA NA NA
9 4851 1 1 NA NA NA NA NA NA NA NA NA
10 4854 1 1 NA NA NA NA NA NA NA NA NA
11 4855 1 1 1 1 1 NA NA NA NA NA NA
12 4857 1 1 1 1 1 1 1 1 1 NA NA
13 4858 1 1 1 1 1 1 1 1 1 1 1
14 4859 1 1 1 1 1 NA NA NA NA NA NA
15 4861 1 1 1 1 1 1 1 1 1 1 1
16 4862 1 1 1 1 1 1 1 1 1 NA NA
17 4863 1 1 NA NA NA NA NA NA NA NA NA
18 4864 1 1 NA NA NA NA NA NA NA NA NA
19 4865 1 1 1 NA NA NA NA NA NA NA NA
Table B:
# A tibble: 114 × 3
fish station seen
<fct> <fct> <int>
1 4842 Release 1
2 4842 I80_1 1
3 4842 Lisbon 1
4 4842 Rstr 1
5 4842 Base_TD 1
6 4842 BCE 1
7 4842 BCW 1
8 4842 BCE2 1
9 4842 BCW2 1
10 4842 MAE 1
# ℹ 104 more rows
Solution
pivot_longer
Unite and separate
There are two additional functions from Tidyr
that are very useful for organizing data: unite()
and separate()
. These are used to split or combine columns.
Separate
For example, you may have noticed that our feature column from our example data is really two types of information combined (an Ensembl id and a gene abbreviation). If we want to separate this column into two, we could easily do this with the help of separate()
.
Let's see this in action. We want to separate the column Feature
at the first .
. This requires the data
, the column we want to separate (col
), and the names of the new variables to create from the separated column (into
). The default separator to split the columns is "[^[:alnum:]]+"
. This is a regular expression that matches 1 or more non-alphanumeric values (i.e., characters that are neither alphabetical (a-z) nor numerical(0-9)).
l_air2<-separate(l_air, Feature, into=c("Ensembl_ID","gene_abb"),
remove=TRUE)
head(l_air2)
# A tibble: 6 × 4
Ensembl_ID gene_abb Sample Count
<chr> <chr> <chr> <int>
1 ENSG00000000003 TSPAN6 Accession.SRR1039508 679
2 ENSG00000000003 TSPAN6 Accession.SRR1039509 448
3 ENSG00000000003 TSPAN6 Accession.SRR1039512 873
4 ENSG00000000003 TSPAN6 Accession.SRR1039513 408
5 ENSG00000000003 TSPAN6 Accession.SRR1039516 1138
6 ENSG00000000003 TSPAN6 Accession.SRR1039517 1047
separate_wider_position() and separate_wider_delim()
separate()
has been superseded in favor of separate_wider_position()
, separate_wider_delim()
, and separate_wider_regex()
. "A superseded function has a known better alternative, but the function itself is not going away."
separate_wider_delim()
- splits by delimiter.
separate_wider_position()
- splits at fixed widths.
separate_wider_regex()
- splits with regular expression matches.
Unite
unite()
is simply the opposing function to separate()
. Let's use unite()
to combine our columns (Ensemble_ID
and gene_abb
) back together. This time we will use a _
between our ensembleID and gene abbreviations.
l_air3<-unite(l_air2, "Feature", c(Ensembl_ID,gene_abb),sep="_")
head(l_air3)
# A tibble: 6 × 3
Feature Sample Count
<chr> <chr> <int>
1 ENSG00000000003_TSPAN6 Accession.SRR1039508 679
2 ENSG00000000003_TSPAN6 Accession.SRR1039509 448
3 ENSG00000000003_TSPAN6 Accession.SRR1039512 873
4 ENSG00000000003_TSPAN6 Accession.SRR1039513 408
5 ENSG00000000003_TSPAN6 Accession.SRR1039516 1138
6 ENSG00000000003_TSPAN6 Accession.SRR1039517 1047
A word about regular expressions
As you continue to work in R, at some point you will need to incorporate regular expressions into your code. Regular expressions can be exceedingly complicated and like anything require time and practice. We will not take a deep dive into regular expressions in this course. A great place to start with regular expressions is Chapter 14: Strings from R4DS. You may also find this stringr vignette helpful.
The Janitor package.
Check out the janitor
package for additional functions for exploring and cleaning messy data.
Acknowledgements
Material from this lesson was inspired by R4DS and Tidyverse Skills for Data Science.
Resources
readr / readxl cheatsheet
Tidyr cheatsheet
Stringr / regex cheatsheet