Data import and reshape

Lesson Objectives

Briefly review how to import data
Data reshape with tidyr: pivot_longer(), pivot_wider(), separate(), and unite()

To get started with this lesson, you will first need to connect to RStudio on Biowulf. To connect to NIH HPC Open OnDemand, you must be on the NIH network. Use the following website to connect: https://hpcondemand.nih.gov/. Then follow the instructions outlined here.

Load the tidyverse

We will use core packages from the tidyverse for our data wrangling needs. Data reshaping primarily involves the tidyverse package, tidyr, but we will use additional packages as well, such as tibble.

Packages must be loaded with each new R session.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Importing data

Before we can do anything with our data, we need to first import it into R.

We can either use the data import from the RStudio drop-down menu (File > Import Dataset), or we can use R functions for reading in data (Recommended). These functions generally start with read. The Base R read functions are followed by a ., while the readr functions are followed by an _. readr functions are from the readr package, which is a part of the tidyverse. readr functions are typically faster, more reproducible and consistent, and are better at recognizing certain types of data (e.g., dates). However, they also result in tibbles rather than data frames, and are not row name friendly.

Info

Tibbles are like data frames with some small but apparent modifications. For example, they can have numbers for column names, and the column types are immediately apparent when viewing. Additionally, when you call a tibble by running the object name, the entire data frame does not print to the screen, rather the first ten rows along with the columns that fit the screen are shown.

Some different import functions

Import Excel files:
- readxl::read_excel(). - readxl::read_xls(). - readxl::read_xlsx()

Import tab-delimited files (.tsv, .txt):
- read.delim()
- read.table(). - readr::read_delim(). - readr::read_tsv()
- readr::read_table()

Comma separated files (.csv):
- read.csv()
- readr::read_csv()

The most important argument of all of these functions is the file path.

File paths

A file path tells us the location of a file or folder (a.k.a., directory). Because it is a character string, it must be surrounded by quotes. Each directory is separated by a /. It is best practice to work in R projects and use relative file paths to make scripts more reproducible.

Genomic Data:
- For genomic specific formats, you will likely need to install specific packages; check out Bioconductor for packages relevant to bioinformatics.

Other:
- For information on importing other files types (e.g., json, xml, google sheets), check out this chapter from Tidyverse Skills for Data Science by Carrie Wright, Shannon E. Ellis, Stephanie C. Hicks and Roger D. Peng.

Load the lesson data

For today's lesson, we will work with data available from R (Base R and the tidyverse) as well as an example RNA-Seq count matrix. The count matrix is currently in the format "genes x samples", with the gene IDs, which are a combination of Ensembl IDs and gene symbols, as row names.

Get the Data

To download the data used in this lesson and future lessons to your local computer, click here. You can then move the downloaded directory to your working directory in R.

To use the data on Biowulf, open your Terminal in R and follow these steps:

# change to your working directory
cd /data/$USER/Data_Wrangling_with_R
# use wget to grab the zipped directory
wget https://bioinformatics.ccr.cancer.gov/docs/r_for_novices/Intro_to_Data_Wrangling/data.zip
# unzip the data
unzip -d data data.zip

Alternatively, you can download the zip to local and upload to RStudio Server.

Load the Data

Let's use read.delim to load the data.

aircount<-read.delim("./data/head50_airway_nonnorm_count.txt",
                     row.names = 1)  
head(aircount)

                         Accession.SRR1039508 Accession.SRR1039509
ENSG00000000003.TSPAN6                    679                  448
ENSG00000000005.TNMD                        0                    0
ENSG00000000419.DPM1                      467                  515
ENSG00000000457.SCYL3                     260                  211
ENSG00000000460.C1orf112                   60                   55
ENSG00000000938.FGR                         0                    0
                         Accession.SRR1039512 Accession.SRR1039513
ENSG00000000003.TSPAN6                    873                  408
ENSG00000000005.TNMD                        0                    0
ENSG00000000419.DPM1                      621                  365
ENSG00000000457.SCYL3                     263                  164
ENSG00000000460.C1orf112                   40                   35
ENSG00000000938.FGR                         2                    0
                         Accession.SRR1039516 Accession.SRR1039517
ENSG00000000003.TSPAN6                   1138                 1047
ENSG00000000005.TNMD                        0                    0
ENSG00000000419.DPM1                      587                  799
ENSG00000000457.SCYL3                     245                  331
ENSG00000000460.C1orf112                   78                   63
ENSG00000000938.FGR                         1                    0
                         Accession.SRR1039520 Accession.SRR1039521
ENSG00000000003.TSPAN6                    770                  572
ENSG00000000005.TNMD                        0                    0
ENSG00000000419.DPM1                      417                  508
ENSG00000000457.SCYL3                     233                  229
ENSG00000000460.C1orf112                   76                   60
ENSG00000000938.FGR                         0                    0

The first thing we should do following data import is to examine the data. We need to know what is included in this data frame. What are the dimensions? What types of data are stored in each column?

How can we examine these data further?

str(aircount)

'data.frame':   50 obs. of  8 variables:
 $ Accession.SRR1039508: int  679 0 467 260 60 0 3251 1433 519 394 ...
 $ Accession.SRR1039509: int  448 0 515 211 55 0 3679 1062 380 236 ...
 $ Accession.SRR1039512: int  873 0 621 263 40 2 6177 1733 595 464 ...
 $ Accession.SRR1039513: int  408 0 365 164 35 0 4252 881 493 175 ...
 $ Accession.SRR1039516: int  1138 0 587 245 78 1 6721 1424 820 658 ...
 $ Accession.SRR1039517: int  1047 0 799 331 63 0 11027 1439 714 584 ...
 $ Accession.SRR1039520: int  770 0 417 233 76 0 5176 1359 696 360 ...
 $ Accession.SRR1039521: int  572 0 508 229 60 0 7995 1109 704 269 ...

glimpse(aircount)

Rows: 50
Columns: 8
$ Accession.SRR1039508 <int> 679, 0, 467, 260, 60, 0, 3251, 1433, 519, 394, 17…
$ Accession.SRR1039509 <int> 448, 0, 515, 211, 55, 0, 3679, 1062, 380, 236, 16…
$ Accession.SRR1039512 <int> 873, 0, 621, 263, 40, 2, 6177, 1733, 595, 464, 26…
$ Accession.SRR1039513 <int> 408, 0, 365, 164, 35, 0, 4252, 881, 493, 175, 118…
$ Accession.SRR1039516 <int> 1138, 0, 587, 245, 78, 1, 6721, 1424, 820, 658, 2…
$ Accession.SRR1039517 <int> 1047, 0, 799, 331, 63, 0, 11027, 1439, 714, 584, …
$ Accession.SRR1039520 <int> 770, 0, 417, 233, 76, 0, 5176, 1359, 696, 360, 15…
$ Accession.SRR1039521 <int> 572, 0, 508, 229, 60, 0, 7995, 1109, 704, 269, 17…

Data reshape

Now that we have some data to work with, let's learn how we can reshape it. Recall how we defined tidy data.

Specifically, tidy data has 3 components:

Each column is a variable, a quantity, quality, or property that can be collected or measured.

Each row is an observation, or set of values collected under similar conditions.

Each cell is a value, or state of a variable when you measure it. --- r4ds

We can organize data in many different ways. Some of these ways will be easier to work with, generally the tidy way.

What do we mean by reshaping data?

Data reshaping is one aspect of tidying our data. The shape of our data is determined by how values are organized across rows and columns. When reshaping data, we are most often wrangling the data from wide to long format or vice versa. To tidy the data we will need to (1) know the difference between observations and variables, and (2) potentially resolve cases in which a single variable is spread across multiple columns or a single observation is spread across multiple rows R4DS.

It is difficult to provide a single definition for what is wide data vs long data, as both can take different forms, and both can be considered tidy depending on the circumstance (e.g., analysis goals).

Note

While we are interested in getting data into a "tidy" format, your data should ultimately be wrangled into a format that is going to work with downstream analyses.

In general, in wide data there is often a single metric spread across multiple columns. This type of data often, but not always, takes on a matrix like appearance.

While in long data, each variable tends to have its own column.

See this example from R4DS:
Example image from R4DS

However, these definitions depend on what you are ultimately considering a variable and what you are considering an observation.

For example, which of the following data representations is the tidy option?

Wide format:

tibble(iris)

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

Long format:

iris_long<-tibble(iris) %>% 
  rownames_to_column("Iris_id") %>%
  pivot_longer(2:5,names_to="Flower_property",values_to="Measurement")

With the long format it is easier to summarize information about the properties of the flowers but in the wide format it is easier to explore relationships between these properties.

For example, this code is simpler

iris_long %>% 
  group_by(Species, Flower_property) %>%
  summarize(mean= mean(Measurement), sd = sd(Measurement))

`summarise()` has grouped output by 'Species'. You can override using the
`.groups` argument.

# A tibble: 12 × 4
# Groups:   Species [3]
   Species    Flower_property  mean    sd
   <fct>      <chr>           <dbl> <dbl>
 1 setosa     Petal.Length    1.46  0.174
 2 setosa     Petal.Width     0.246 0.105
 3 setosa     Sepal.Length    5.01  0.352
 4 setosa     Sepal.Width     3.43  0.379
 5 versicolor Petal.Length    4.26  0.470
 6 versicolor Petal.Width     1.33  0.198
 7 versicolor Sepal.Length    5.94  0.516
 8 versicolor Sepal.Width     2.77  0.314
 9 virginica  Petal.Length    5.55  0.552
10 virginica  Petal.Width     2.03  0.275
11 virginica  Sepal.Length    6.59  0.636
12 virginica  Sepal.Width     2.97  0.322

than this

iris %>% group_by(Species) %>%
  summarize(across(where(is.numeric),list(mean = mean, sd=sd)))

# A tibble: 3 × 9
  Species    Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean Sepal.Width_sd
  <fct>                  <dbl>           <dbl>            <dbl>          <dbl>
1 setosa                  5.01           0.352             3.43          0.379
2 versicolor              5.94           0.516             2.77          0.314
3 virginica               6.59           0.636             2.97          0.322
# ℹ 4 more variables: Petal.Length_mean <dbl>, Petal.Length_sd <dbl>,
#   Petal.Width_mean <dbl>, Petal.Width_sd <dbl>

Regardless, you may want one format or the other depending on your analysis goals. Many of the tidyverse tools (e.g., ggplot2) seem to work better with long format data, but this again, will depend on your task.

The tools we use to go from wide to long and long to wide are from the package tidyr. Because we already loaded the package tidyverse, we do not need to load tidyr, as it is a core package.

`pivot_wider()` and `pivot_longer()`

pivot_wider() and pivot_longer() have replaced the functions gather() and spread(). pivot_wider() converts long format data to wide, while pivot_longer() converts wide format data to long.

If you haven't guessed already, our count matrix is currently in wide format. If we wanted to merge these data with sample metadata and plot various aspects of the data using ggplot2, we would likely want these data in long format.

Pivot_longer

Let's check out the help documentation ?pivot_longer(). This function requires the data and the columns we want to combine (cols). There are also a number of optional arguments involving the name column and the value column.

For the cols argument, we can select columns using the same arguments we would use with select(), including column names, indices, or the select helper functions, for example,

contains(),
starts_with(),
ends_with(),
etc.

Columns in the Tidyverse

In Base R, we often have to refer to data variables (columns) directly using an accessor like $. However, this is not the case in the tidyverse. In the tidyverse, columns that exist generally do not need quotes, while columns that do not yet exist do need quotes. This difference has important implications for creating for loops and functions. Learn more about tidy evaluation here.

Let's pivot aircount.

l_air<-pivot_longer(aircount,1:length(aircount),names_to ="Sample",
                    values_to= "Count")
head(l_air)

# A tibble: 6 × 2
  Sample               Count
  <chr>                <int>
1 Accession.SRR1039508   679
2 Accession.SRR1039509   448
3 Accession.SRR1039512   873
4 Accession.SRR1039513   408
5 Accession.SRR1039516  1138
6 Accession.SRR1039517  1047

Notice that the row names were dropped. While we would want to keep row names if we were working with this matrix as is, because we want a long data frame, we will need to first put the row names into a column. For this, we will use rownames_to_column() from the tidyverse package tibble.

#save row names as a column
aircount<-rownames_to_column(aircount,"Feature") 
head(aircount["Feature"])

                   Feature
1   ENSG00000000003.TSPAN6
2     ENSG00000000005.TNMD
3     ENSG00000000419.DPM1
4    ENSG00000000457.SCYL3
5 ENSG00000000460.C1orf112
6      ENSG00000000938.FGR

#pivot longer...again
l_air<-pivot_longer(aircount,starts_with("Accession"),
                    names_to =c("Sample"),values_to= "Count")
head(l_air)

# A tibble: 6 × 3
  Feature                Sample               Count
  <chr>                  <chr>                <int>
1 ENSG00000000003.TSPAN6 Accession.SRR1039508   679
2 ENSG00000000003.TSPAN6 Accession.SRR1039509   448
3 ENSG00000000003.TSPAN6 Accession.SRR1039512   873
4 ENSG00000000003.TSPAN6 Accession.SRR1039513   408
5 ENSG00000000003.TSPAN6 Accession.SRR1039516  1138
6 ENSG00000000003.TSPAN6 Accession.SRR1039517  1047

Pivot_wider

How can we get this back to a wide format? We can use ?pivot_wider(). This requires two additional arguments beyond the data argument: names_from and values_from. The first, names_from should be the name of the column containing the new column names for your wide data. values_from is the column that contains the values to fill the rows of your wide data columns. Because these columns already exist, we do not need to put them in quotes.

Let's pivot the data from long to wide.

w_air<-pivot_wider(l_air,names_from = Sample, 
                          values_from = Count)
head(w_air)

# A tibble: 6 × 9
  Feature         Accession.SRR1039508 Accession.SRR1039509 Accession.SRR1039512
  <chr>                          <int>                <int>                <int>
1 ENSG0000000000…                  679                  448                  873
2 ENSG0000000000…                    0                    0                    0
3 ENSG0000000041…                  467                  515                  621
4 ENSG0000000045…                  260                  211                  263
5 ENSG0000000046…                   60                   55                   40
6 ENSG0000000093…                    0                    0                    2
# ℹ 5 more variables: Accession.SRR1039513 <int>, Accession.SRR1039516 <int>,
#   Accession.SRR1039517 <int>, Accession.SRR1039520 <int>,
#   Accession.SRR1039521 <int>

Note

There are many optional arguments for both of these functions. These are there to help you reshape seemingly complicated data schemes. Don't get discouraged. The examples in the help documentation are extremely helpful.

Test our knowledge

What function would we use to transform table A to table B?

Table A:

# A tibble: 19 × 12
   fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
   <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
 1 4842        1     1      1     1       1     1     1     1     1     1     1
 2 4843        1     1      1     1       1     1     1     1     1     1     1
 3 4844        1     1      1     1       1     1     1     1     1     1     1
 4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA
 5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
 6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA    NA
 7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
 8 4850        1     1     NA     1       1     1     1    NA    NA    NA    NA
 9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
11 4855        1     1      1     1       1    NA    NA    NA    NA    NA    NA
12 4857        1     1      1     1       1     1     1     1     1    NA    NA
13 4858        1     1      1     1       1     1     1     1     1     1     1
14 4859        1     1      1     1       1    NA    NA    NA    NA    NA    NA
15 4861        1     1      1     1       1     1     1     1     1     1     1
16 4862        1     1      1     1       1     1     1     1     1    NA    NA
17 4863        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
18 4864        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
19 4865        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA

Table B:

# A tibble: 114 × 3
   fish  station  seen
   <fct> <fct>   <int>
 1 4842  Release     1
 2 4842  I80_1       1
 3 4842  Lisbon      1
 4 4842  Rstr        1
 5 4842  Base_TD     1
 6 4842  BCE         1
 7 4842  BCW         1
 8 4842  BCE2        1
 9 4842  BCW2        1
10 4842  MAE         1
# ℹ 104 more rows

Solution

pivot_longer

Unite and separate

There are two additional functions from Tidyr that are very useful for organizing data: unite() and separate(). These are used to split or combine columns.

Separate

For example, you may have noticed that our feature column from our example data is really two types of information combined (an Ensembl id and a gene abbreviation). If we want to separate this column into two, we could easily do this with the help of separate().

Let's see this in action. We want to separate the column Feature at the first .. This requires the data, the column we want to separate (col), and the names of the new variables to create from the separated column (into). The default separator to split the columns is "[^[:alnum:]]+". This is a regular expression that matches 1 or more non-alphanumeric values (i.e., characters that are neither alphabetical (a-z) nor numerical(0-9)).

l_air2<-separate(l_air, Feature, into=c("Ensembl_ID","gene_abb"),
                 remove=TRUE)
head(l_air2)

# A tibble: 6 × 4
  Ensembl_ID      gene_abb Sample               Count
  <chr>           <chr>    <chr>                <int>
1 ENSG00000000003 TSPAN6   Accession.SRR1039508   679
2 ENSG00000000003 TSPAN6   Accession.SRR1039509   448
3 ENSG00000000003 TSPAN6   Accession.SRR1039512   873
4 ENSG00000000003 TSPAN6   Accession.SRR1039513   408
5 ENSG00000000003 TSPAN6   Accession.SRR1039516  1138
6 ENSG00000000003 TSPAN6   Accession.SRR1039517  1047

separate_wider_position() and separate_wider_delim()

separate() has been superseded in favor of separate_wider_position(), separate_wider_delim(), and separate_wider_regex(). "A superseded function has a known better alternative, but the function itself is not going away."

separate_wider_delim() - splits by delimiter.
separate_wider_position() - splits at fixed widths.
separate_wider_regex() - splits with regular expression matches.

Unite

unite() is simply the opposing function to separate(). Let's use unite() to combine our columns (Ensemble_ID and gene_abb) back together. This time we will use a _ between our ensembleID and gene abbreviations.

l_air3<-unite(l_air2, "Feature", c(Ensembl_ID,gene_abb),sep="_")
head(l_air3)

# A tibble: 6 × 3
  Feature                Sample               Count
  <chr>                  <chr>                <int>
1 ENSG00000000003_TSPAN6 Accession.SRR1039508   679
2 ENSG00000000003_TSPAN6 Accession.SRR1039509   448
3 ENSG00000000003_TSPAN6 Accession.SRR1039512   873
4 ENSG00000000003_TSPAN6 Accession.SRR1039513   408
5 ENSG00000000003_TSPAN6 Accession.SRR1039516  1138
6 ENSG00000000003_TSPAN6 Accession.SRR1039517  1047

A word about regular expressions

As you continue to work in R, at some point you will need to incorporate regular expressions into your code. Regular expressions can be exceedingly complicated and like anything require time and practice. We will not take a deep dive into regular expressions in this course. A great place to start with regular expressions is Chapter 14: Strings from R4DS. You may also find this stringr vignette helpful.

The Janitor package.

Check out the janitor package for additional functions for exploring and cleaning messy data.

Acknowledgements

Material from this lesson was inspired by R4DS and Tidyverse Skills for Data Science.

Resources

readr / readxl cheatsheet
Tidyr cheatsheet
Stringr / regex cheatsheet