Data import and reshape

Objectives
1. Learn to import multiple data types
2. Data reshape with tidyr: pivot_longer(), pivot_wider(), separate(), and unite()

Installing and loading packages

So far we have only worked with objects that we created in RStudio. We have not installed or loaded any packages. R packages extend the use of R programming beyond base R.

Where do we get R packages?

As a reminder, R packages are loadable extensions that contain code, data, documentation, and tests in a standardized shareable format that can easily be installed by R users. The primary repository for R packages is the Comprehensive R Archive Network (CRAN). CRAN is a global network of servers that store identical versions of R code, packages, documentation, etc (cran.r-project.org). To install a CRAN package, use install.packages().

Github is another common source used to store R packages; though, these packages do not necessarily meet CRAN standards so approach with caution. To install a Github package, use library(devtools) followed by install_github(). devtools is a CRAN package. If you have not installed it, you may use install.packages("devtools") prior to the previous steps.

Many genomics and other packages useful to biologists / molecular biologists can be found on Bioconductor. To install a Bioconductor package, you will first need to install BiocManager, a CRAN package (install.packages("BiocManager")). You can then use BiocManager to install the Bioconductor core packages or any specific package (e.g., BiocManager::install("DESeq2")).

Importing / exporting data

Before we can do anything with our data, we need to first import it into R. There are several ways to do this.

First, the RStudio IDE has a drop down menu for data import. Simply go to File > Import Dataset and select one of the options and follow the prompts.

We should pay close attention to the import functions and their arguments. Using the import arguments correctly can save us from a headache later down the road. You will notice two types of import functions under Import Dataset "from text": base R import functions and readr import functions. We will use both in this course.

Note

Tidyverse packages are generally against assigning rownames and instead prefer that all column data are treated the same, but there are times when this is beneficial and will be required for genomics data (e.g., See SummarizedExperiment from Bioconductor).

Loading Tidyverse

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

What is a tibble?

When loading tabular data with readr, the default object created will be a tibble. Tibbles are like data frames with some small but apparent modifications. For example, they can have numbers for column names, and the column types are immediately apparent when viewing. Additionally, when you call a tibble by running the object name, the entire data frame does not print to the screen, rather the first ten rows along with the columns that fit the screen are shown.

Reasons to use `readr` functions

They are typically much faster (~10x) than their base equivalents. Long running jobs have a progress bar, so you can see what’s happening. If you’re looking for raw speed, try data.table::fread(). It doesn’t fit quite so well into the tidyverse, but it can be quite a bit faster.

They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions.

They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else’s. ---R4DS

Excel files (.xls, .xlsx)

Excel files are the primary means by which many people save spreadsheet data. .xls or .xlsx files store workbooks composed of one or more spreadsheets.

Importing excel files requires the R package readxl. While this is a tidyverse package, it is not core and must be loaded separately.

library(readxl)

The functions to import excel files are read_excel(), read_xls(), and read_xlsx(). The latter two are more specific based on file format, whereas the first will guess which format (.xls or .xlsx) we are working with.

Let's look at its basic usage using an example data set from the readxl package. To access the example data we use readxl_example().

#makes example data accessible by storing the path 
ex_xl<-readxl_example("datasets.xlsx")  
ex_xl

## [1] "/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/readxl/extdata/datasets.xlsx"

Now, let's read in the data. The only required argument is a path to the file to be imported.

irisdata<-read_excel(ex_xl)
irisdata

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

Notice that the resulting imported data is a tibble. This is a feature specific to tidyverse. Now, let's check out some of the additional arguments. We can view the help information using ?read_excel().

The arguments likely to be most pertinent to you are:

sheet - the name or numeric position of the excel sheet to read.
col_names - default TRUE uses the first read in row for the column names. You can also provide a vector of names to name the columns.
skip - will allow us to skip rows that we do not wish to read in.
.name_repair - automatically set to "unique", which makes sure that the column names are not empty and are all unique. read_excel() and * readr functions will not correct column names to make them syntactic. If you want corrected names, use .name_repair = "universal".

Let's check out another example:

sum_air<-read_excel("./data/RNASeq_totalcounts_vs_totaltrans.xlsx")

## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`

sum_air

## # A tibble: 11 × 4
##    `Uses Airway Data`               ...2          ...3                  ...4    
##    <chr>                            <chr>         <chr>                 <chr>   
##  1 Some RNA-Seq summary information <NA>          <NA>                  <NA>    
##  2 <NA>                             <NA>          <NA>                  <NA>    
##  3 Sample Name                      Treatment     Number of Transcripts Total C…
##  4 GSM1275863                       Dexamethasone 10768                 18783120
##  5 GSM1275867                       Dexamethasone 10051                 15144524
##  6 GSM1275871                       Dexamethasone 11658                 30776089
##  7 GSM1275875                       Dexamethasone 10900                 21135511
##  8 GSM1275862                       None          11177                 20608402
##  9 GSM1275866                       None          11526                 25311320
## 10 GSM1275870                       None          11425                 24411867
## 11 GSM1275874                       None          11000                 19094104

Upon importing this data, we can immediately see that something is wrong with the column names.

colnames(sum_air)

## [1] "Uses Airway Data" "...2"             "...3"             "...4"

There are some extra rows of information at the beginning of the data frame that should be excluded. We can take advantage of additional arguments to load only the data we are interested in. We are also going to tell read_excel() that we want the names repaired to eliminate spaces.

sum_air<-read_excel("./data/RNASeq_totalcounts_vs_totaltrans.xlsx",
                    skip=3,.name_repair = "universal")

## New names:
## • `Sample Name` -> `Sample.Name`
## • `Number of Transcripts` -> `Number.of.Transcripts`
## • `Total Counts` -> `Total.Counts`

sum_air

## # A tibble: 8 × 4
##   Sample.Name Treatment     Number.of.Transcripts Total.Counts
##   <chr>       <chr>                         <dbl>        <dbl>
## 1 GSM1275863  Dexamethasone                 10768     18783120
## 2 GSM1275867  Dexamethasone                 10051     15144524
## 3 GSM1275871  Dexamethasone                 11658     30776089
## 4 GSM1275875  Dexamethasone                 10900     21135511
## 5 GSM1275862  None                          11177     20608402
## 6 GSM1275866  None                          11526     25311320
## 7 GSM1275870  None                          11425     24411867
## 8 GSM1275874  None                          11000     19094104

Name repair

Learn more about the .name_repair arguments here.

Tab-delimited files (.tsv, .txt)

In tab delimited files, data columns are separated by tabs.

To import tab-delimited files there are several options. There are base R functions such as read.delim() and read.table() as well as the readr functions read_delim(), read_tsv(), and read_table().

Let's take a look at ?read.delim() and ?read_delim(), which are most appropriate if you are working with tab delimited data stored in a .txt file.

For read.delim(), you will notice that the default separator (sep) is white space, which can be one or more spaces, tabs, newlines. However, you could use this function to load a comma separated file as well; you simply need to use sep = ",". The same is true of read_delim(), except the argument is delim rather than sep.

Let's load sample information from the RNA-Seq project airway. We will refer back to some of these data frequently throughout our lessons. The airway data is from Himes et al. (2014). These data, which are available in R as a RangedSummarizedExperiment object, are from a bulk RNAseq experiment. In the experiment, the authors "characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone," a common therapy for asthma. The airway package includes RNAseq count data from 8 airway smooth muscle cell samples. Each cell line includes a treated and untreated negative control.

Using read.delim():

smeta<-read.delim("./data/airway_sampleinfo.txt")
head(smeta)

##   SampleName    cell   dex albut        Run avgLength Experiment    Sample
## 1 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345 SRS508568
## 2 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346 SRS508567
## 3 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349 SRS508571
## 4 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350 SRS508572
## 5 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353 SRS508575
## 6 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354 SRS508576
##      BioSample
## 1 SAMN02422669
## 2 SAMN02422675
## 3 SAMN02422678
## 4 SAMN02422670
## 5 SAMN02422682
## 6 SAMN02422673

Some other arguments of interest for read.delim():
row.names - used to specify row names.
col.names - use to specify column names if header = FALSE.
skip - Similar to read_excel(), used to skip a number of lines preceding the data we are interested in importing.
check.names - makes names syntactically valid and unique.

Using read_delim():

smeta2<-read_delim("./data/airway_sampleinfo.txt")

## Rows: 8 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
## dbl (1): avgLength
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

smeta2

## # A tibble: 8 × 9
##   SampleName cell    dex   albut Run       avgLength Experiment Sample BioSample
##   <chr>      <chr>   <chr> <chr> <chr>         <dbl> <chr>      <chr>  <chr>    
## 1 GSM1275862 N61311  untrt untrt SRR10395…       126 SRX384345  SRS50… SAMN0242…
## 2 GSM1275863 N61311  trt   untrt SRR10395…       126 SRX384346  SRS50… SAMN0242…
## 3 GSM1275866 N052611 untrt untrt SRR10395…       126 SRX384349  SRS50… SAMN0242…
## 4 GSM1275867 N052611 trt   untrt SRR10395…        87 SRX384350  SRS50… SAMN0242…
## 5 GSM1275870 N080611 untrt untrt SRR10395…       120 SRX384353  SRS50… SAMN0242…
## 6 GSM1275871 N080611 trt   untrt SRR10395…       126 SRX384354  SRS50… SAMN0242…
## 7 GSM1275874 N061011 untrt untrt SRR10395…       101 SRX384357  SRS50… SAMN0242…
## 8 GSM1275875 N061011 trt   untrt SRR10395…        98 SRX384358  SRS50… SAMN0242…

Comma separated files (.csv)

In comma separated files the columns are separated by commas and the rows are separated by new lines.

To read comma separated files, we can use the specific functions ?read.csv() and ?read_csv().

Let's see this in action:

cexamp<-read.csv("./data/surveys_datacarpentry.csv")
head(cexamp)

##   record_id month day year plot_id species_id sex hindfoot_length weight
## 1         1     7  16 1977       2         NL   M              32     NA
## 2         2     7  16 1977       3         NL   M              33     NA
## 3         3     7  16 1977       2         DM   F              37     NA
## 4         4     7  16 1977       7         DM   M              36     NA
## 5         5     7  16 1977       3         DM   M              35     NA
## 6         6     7  16 1977       1         PF   M              14     NA

The arguments are the same as read.delim().

Let's check out read_csv():

cexamp2<-read_csv("./data/surveys_datacarpentry.csv")

## Rows: 35549 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): species_id, sex
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

cexamp2

## # A tibble: 35,549 × 9
##    record_id month   day  year plot_id species_id sex   hindfoot_length weight
##        <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
##  1         1     7    16  1977       2 NL         M                  32     NA
##  2         2     7    16  1977       3 NL         M                  33     NA
##  3         3     7    16  1977       2 DM         F                  37     NA
##  4         4     7    16  1977       7 DM         M                  36     NA
##  5         5     7    16  1977       3 DM         M                  35     NA
##  6         6     7    16  1977       1 PF         M                  14     NA
##  7         7     7    16  1977       2 PE         F                  NA     NA
##  8         8     7    16  1977       1 DM         M                  37     NA
##  9         9     7    16  1977       1 DM         F                  34     NA
## 10        10     7    16  1977       6 PF         F                  20     NA
## # ℹ 35,539 more rows

Other file types

There are a number of other file types you may be interested in. For genomic specific formats, you will likely need to install specific packages; check out Bioconductor for packages relevant to bioinformatics.

For information on importing other files types (e.g., json, xml, google sheets), check out this chapter from Tidyverse Skills for Data Science by Carrie Wright, Shannon E. Ellis, Stephanie C. Hicks and Roger D. Peng.

An Example

Let's load in a count matrix from airway to work with and reshape.

aircount<-read.delim("./data/head50_airway_nonnorm_count.txt")  
head(aircount)

##                          X Accession.SRR1039508 Accession.SRR1039509
## 1   ENSG00000000003.TSPAN6                  679                  448
## 2     ENSG00000000005.TNMD                    0                    0
## 3     ENSG00000000419.DPM1                  467                  515
## 4    ENSG00000000457.SCYL3                  260                  211
## 5 ENSG00000000460.C1orf112                   60                   55
## 6      ENSG00000000938.FGR                    0                    0
##   Accession.SRR1039512 Accession.SRR1039513 Accession.SRR1039516
## 1                  873                  408                 1138
## 2                    0                    0                    0
## 3                  621                  365                  587
## 4                  263                  164                  245
## 5                   40                   35                   78
## 6                    2                    0                    1
##   Accession.SRR1039517 Accession.SRR1039520 Accession.SRR1039521
## 1                 1047                  770                  572
## 2                    0                    0                    0
## 3                  799                  417                  508
## 4                  331                  233                  229
## 5                   63                   76                   60
## 6                    0                    0                    0

Because this is a count matrix, we want to save column 'X', which was automatically named, as row names rather than a column.

Let's reload and overwrite the previous object:

aircount<-read.delim("./data/head50_airway_nonnorm_count.txt",
                     row.names = 1)  
head(aircount)

##                          Accession.SRR1039508 Accession.SRR1039509
## ENSG00000000003.TSPAN6                    679                  448
## ENSG00000000005.TNMD                        0                    0
## ENSG00000000419.DPM1                      467                  515
## ENSG00000000457.SCYL3                     260                  211
## ENSG00000000460.C1orf112                   60                   55
## ENSG00000000938.FGR                         0                    0
##                          Accession.SRR1039512 Accession.SRR1039513
## ENSG00000000003.TSPAN6                    873                  408
## ENSG00000000005.TNMD                        0                    0
## ENSG00000000419.DPM1                      621                  365
## ENSG00000000457.SCYL3                     263                  164
## ENSG00000000460.C1orf112                   40                   35
## ENSG00000000938.FGR                         2                    0
##                          Accession.SRR1039516 Accession.SRR1039517
## ENSG00000000003.TSPAN6                   1138                 1047
## ENSG00000000005.TNMD                        0                    0
## ENSG00000000419.DPM1                      587                  799
## ENSG00000000457.SCYL3                     245                  331
## ENSG00000000460.C1orf112                   78                   63
## ENSG00000000938.FGR                         1                    0
##                          Accession.SRR1039520 Accession.SRR1039521
## ENSG00000000003.TSPAN6                    770                  572
## ENSG00000000005.TNMD                        0                    0
## ENSG00000000419.DPM1                      417                  508
## ENSG00000000457.SCYL3                     233                  229
## ENSG00000000460.C1orf112                   76                   60
## ENSG00000000938.FGR                         0                    0

Working with row names

There are functions specific to the tibble package for working with row names. column_to_rownames() could also have been used to assign column X to rows.

Data reshape

What do we mean by reshaping data?

Data reshaping is one aspect of tidying our data. The shape of our data is determined by how values are organized across rows and columns. When reshaping data we are most often wrangling the data from wide to long format or vice versa. To tidy the data we will need to (1) know the difference between observations and variables, and (2) potentially resolve cases in which a single variable is spread across multiple columns or a single observation is spread across multiple rows (R4DS).

It is difficult to provide a single definition for what is wide data vs long data, as both can take different forms, and both can be considered tidy depending on the circumstance.

Important

Remember, while we are interested in getting data into a "tidy" format, your data should ultimately be wrangled into a format that is going to work with downstream analyses.

In general, in wide data there is often a single metric spread across multiple columns. This type of data often, but not always, takes on a matrix like appearance.

While in long data, each variable tends to have its own column.

See this example from R4DS:
Example image from R4DS

However, these definitions depend on what you are ultimately considering a variable and what you are considering an observation.

For example, which of the following data representations is the tidy option?

tibble(iris)

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

iris_long<-tibble(iris) %>% rownames_to_column("Iris_id")
pivot_longer(iris_long,2:5,names_to="Measurement_location",values_to="Measurement")

## # A tibble: 600 × 4
##    Iris_id Species Measurement_location Measurement
##    <chr>   <fct>   <chr>                      <dbl>
##  1 1       setosa  Sepal.Length                 5.1
##  2 1       setosa  Sepal.Width                  3.5
##  3 1       setosa  Petal.Length                 1.4
##  4 1       setosa  Petal.Width                  0.2
##  5 2       setosa  Sepal.Length                 4.9
##  6 2       setosa  Sepal.Width                  3  
##  7 2       setosa  Petal.Length                 1.4
##  8 2       setosa  Petal.Width                  0.2
##  9 3       setosa  Sepal.Length                 4.7
## 10 3       setosa  Sepal.Width                  3.2
## # ℹ 590 more rows

Regardless, you may want one format or the other depending on your analysis goals. Many of the tidyverse tools (e.g., ggplot2) seem to work better with long format data.

The tools we use to go from wide to long and long to wide are from the package tidyr. Because we already loaded the package tidyverse, we do not need to load tidyr, as it is a core package.

`pivot_wider()` and `pivot_longer()`

pivot_wider() and pivot_longer() have replaced the functions gather() and spread(). pivot_wider() converts long format data to wide, while pivot_longer() converts wide format data to long.

If you haven't guessed already, our count matrix is currently in wide format. If we wanted to merge these data with sample metadata and plot various aspects of the data using ggplot2, we would likely want these data in long format.

Let's check out the help documentation ?pivot_longer(). This function requires the data and the columns we want to combine. There are also a number of optional arguments involving the name column and the value column.

l_air<-pivot_longer(aircount,1:length(aircount),names_to ="Sample",
                    values_to= "Count")
head(l_air)

## # A tibble: 6 × 2
##   Sample               Count
##   <chr>                <int>
## 1 Accession.SRR1039508   679
## 2 Accession.SRR1039509   448
## 3 Accession.SRR1039512   873
## 4 Accession.SRR1039513   408
## 5 Accession.SRR1039516  1138
## 6 Accession.SRR1039517  1047

Notice that the row names were dropped. While we would want to keep row names if we were working with this matrix as is, because we want a long data frame, we will need to first put the row names into a column. For this, we will use rownames_to_column() from the tidyverse package tibble.

#save row names as a column
aircount<-rownames_to_column(aircount,"Feature") 
head(aircount["Feature"])

##                    Feature
## 1   ENSG00000000003.TSPAN6
## 2     ENSG00000000005.TNMD
## 3     ENSG00000000419.DPM1
## 4    ENSG00000000457.SCYL3
## 5 ENSG00000000460.C1orf112
## 6      ENSG00000000938.FGR

#pivot longer...again
l_air<-pivot_longer(aircount,starts_with("Accession"),
                    names_to =c("Sample"),values_to= "Count")
head(l_air)

## # A tibble: 6 × 3
##   Feature                Sample               Count
##   <chr>                  <chr>                <int>
## 1 ENSG00000000003.TSPAN6 Accession.SRR1039508   679
## 2 ENSG00000000003.TSPAN6 Accession.SRR1039509   448
## 3 ENSG00000000003.TSPAN6 Accession.SRR1039512   873
## 4 ENSG00000000003.TSPAN6 Accession.SRR1039513   408
## 5 ENSG00000000003.TSPAN6 Accession.SRR1039516  1138
## 6 ENSG00000000003.TSPAN6 Accession.SRR1039517  1047

How can we get this back to a wide format? We can use ?pivot_wider(). This requires two additional arguments beyond the data argument: names_from and values_from. The first, names_from should be the name of the column containing the new column names for your wide data. values_from is the column that contains the values to fill the rows of your wide data columns.

Let's pivot the data from long to wide.

w_air<-tidyr::pivot_wider(l_air,names_from = Sample, 
                          values_from = Count)
head(w_air)

## # A tibble: 6 × 9
##   Feature         Accession.SRR1039508 Accession.SRR1039509 Accession.SRR1039512
##   <chr>                          <int>                <int>                <int>
## 1 ENSG0000000000…                  679                  448                  873
## 2 ENSG0000000000…                    0                    0                    0
## 3 ENSG0000000041…                  467                  515                  621
## 4 ENSG0000000045…                  260                  211                  263
## 5 ENSG0000000046…                   60                   55                   40
## 6 ENSG0000000093…                    0                    0                    2
## # ℹ 5 more variables: Accession.SRR1039513 <int>, Accession.SRR1039516 <int>,
## #   Accession.SRR1039517 <int>, Accession.SRR1039520 <int>,
## #   Accession.SRR1039521 <int>

Note

There are many optional arguments for both of these functions. These are there to help you reshape seemingly complicated data schemes. Don't get discouraged. The examples in the help documentation are extremely helpful.

Unite and separate

There are two additional functions from Tidyr that are very useful for organizing data: unite() and separate(). These are used to split or combine columns.

For example, you may have noticed that our feature column from our example data is really two types of information combined (an Ensembl id and a gene abbreviation). If we want to separate this column into two, we could easily do this with the help of separate().

Let's see this in action. We want to separate the column Feature at the first ..

l_air2<-separate(l_air, Feature, into=c("Ensembl_ID","gene_abb"),
                 sep=".",remove=TRUE)

## Warning: Expected 2 pieces. Additional pieces discarded in 400 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

head(l_air2)

## # A tibble: 6 × 4
##   Ensembl_ID gene_abb Sample               Count
##   <chr>      <chr>    <chr>                <int>
## 1 ""         ""       Accession.SRR1039508   679
## 2 ""         ""       Accession.SRR1039509   448
## 3 ""         ""       Accession.SRR1039512   873
## 4 ""         ""       Accession.SRR1039513   408
## 5 ""         ""       Accession.SRR1039516  1138
## 6 ""         ""       Accession.SRR1039517  1047

Did you notice that warning?
Always take note of warnings. In this case, our column did not separate as expected. It appears that the sep argument needs some adjustment. The description of the function separate() suggests that it can use a regular expression to separate columns. This is our first clue. The sep argument can't interpret what we are telling it to do. The . has a special meaning in regular expressions; it matches any character.

If . matches any character, how do you match a literal .? You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, \, to escape special behaviour. So to match an ., you need the regexp \.. Unfortunately this creates a problem. We use strings to represent regular expressions, and \ is also used as an escape symbol in strings. So to create the regular expression \. we need the string \\.. --- stringr vignette

Lost?

If this explanation of the \ makes little sense to you, check out this reddit post for more help.

l_air2<-separate(l_air, Feature, into=c("Ensembl_ID","gene_abb"),
                 sep="\\.",remove=TRUE)

head(l_air2)

## # A tibble: 6 × 4
##   Ensembl_ID      gene_abb Sample               Count
##   <chr>           <chr>    <chr>                <int>
## 1 ENSG00000000003 TSPAN6   Accession.SRR1039508   679
## 2 ENSG00000000003 TSPAN6   Accession.SRR1039509   448
## 3 ENSG00000000003 TSPAN6   Accession.SRR1039512   873
## 4 ENSG00000000003 TSPAN6   Accession.SRR1039513   408
## 5 ENSG00000000003 TSPAN6   Accession.SRR1039516  1138
## 6 ENSG00000000003 TSPAN6   Accession.SRR1039517  1047

unite() is simply the opposing function to separate(). Let's use unite() to combine our columns (Ensemble_ID and gene_abb) back together. This time we will use a _ between our ensembleID and gene abbreviations.

l_air3<-unite(l_air2, "Feature", c(Ensembl_ID,gene_abb),sep="_")
head(l_air3)

## # A tibble: 6 × 3
##   Feature                Sample               Count
##   <chr>                  <chr>                <int>
## 1 ENSG00000000003_TSPAN6 Accession.SRR1039508   679
## 2 ENSG00000000003_TSPAN6 Accession.SRR1039509   448
## 3 ENSG00000000003_TSPAN6 Accession.SRR1039512   873
## 4 ENSG00000000003_TSPAN6 Accession.SRR1039513   408
## 5 ENSG00000000003_TSPAN6 Accession.SRR1039516  1138
## 6 ENSG00000000003_TSPAN6 Accession.SRR1039517  1047

A word about regular expressions

As you continue to work in R, at some point you will need to incorporate regular expressions into your code. We used a regular expression escape above \\. to denote that we wanted to match a literal . rather than its regex (short for regular expression) alternative, which matches any character except for a new line.

Regular expressions can be exceedingly complicated and like anything require time and practice. We will not take a deep dive into regular expressions in this course. A great place to start with regular expressions is Chapter 14: Strings from R4DS. You may also find this stringr vignette helpful.

Acknowledgements

Material from this lesson was inspired by R4DS and Tidyverse Skills for Data Science. The survey data loaded in the section on comma separated files was taken from a datacarpentry.org lesson.

Additonal Resources

readr / readxl cheatsheet
Tidyr cheatsheet
Stringr / regex cheatsheet