Data import and reshape
Objectives
1. Learn to import multiple data types
2. Data reshape with tidyr
: pivot_longer()
, pivot_wider()
, separate()
, and unite()
Installing and loading packages
So far we have only worked with objects that we created in RStudio. We have not installed or loaded any packages. R packages extend the use of R programming beyond base R.
Where do we get R packages?
As a reminder, R packages are loadable extensions that contain code, data, documentation, and tests in a standardized shareable format that can easily be installed by R users. The primary repository for R packages is the Comprehensive R Archive Network (CRAN). CRAN is a global network of servers that store identical versions of R code, packages, documentation, etc (cran.r-project.org). To install a CRAN package, use install.packages()
.
Github is another common source used to store R packages; though, these packages do not necessarily meet CRAN standards so approach with caution. To install a Github package, use library(devtools)
followed by install_github()
. devtools
is a CRAN package. If you have not installed it, you may use install.packages("devtools")
prior to the previous steps.
Many genomics and other packages useful to biologists / molecular biologists can be found on Bioconductor. To install a Bioconductor package, you will first need to install BiocManager
, a CRAN package (install.packages("BiocManager")
). You can then use BiocManager
to install the Bioconductor core packages or any specific package (e.g., BiocManager::install("DESeq2")
).
Importing / exporting data
Before we can do anything with our data, we need to first import it into R. There are several ways to do this.
First, the RStudio IDE has a drop down menu for data import. Simply go to File
> Import Dataset
and select one of the options and follow the prompts.
We should pay close attention to the import functions and their arguments. Using the import arguments correctly can save us from a headache later down the road. You will notice two types of import functions under Import Dataset
"from text": base R import functions and readr
import functions. We will use both in this course.
Note
Tidyverse
packages are generally against assigning rownames
and instead prefer that all column data are treated the same, but there are times when this is beneficial and will be required for genomics data (e.g., See SummarizedExperiment
from Bioconductor).
Loading Tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
What is a tibble?
When loading tabular data with readr
, the default object created will be a tibble
. Tibbles are like data frames with some small but apparent modifications. For example, they can have numbers for column names, and the column types are immediately apparent when viewing. Additionally, when you call a tibble by running the object name, the entire data frame does not print to the screen, rather the first ten rows along with the columns that fit the screen are shown.
Reasons to use readr
functions
They are typically much faster (~10x) than their base equivalents. Long running jobs have a progress bar, so you can see what’s happening. If you’re looking for raw speed, try data.table::fread(). It doesn’t fit quite so well into the tidyverse, but it can be quite a bit faster.
They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions.
They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else’s. ---R4DS
Excel files (.xls, .xlsx)
Excel files are the primary means by which many people save spreadsheet data. .xls or .xlsx files store workbooks composed of one or more spreadsheets.
Importing excel files requires the R package readxl
. While this is a tidyverse package, it is not core and must be loaded separately.
library(readxl)
The functions to import excel files are read_excel()
, read_xls()
, and read_xlsx()
. The latter two are more specific based on file format, whereas the first will guess which format (.xls or .xlsx) we are working with.
Let's look at its basic usage using an example data set from the readxl
package. To access the example data we use readxl_example()
.
#makes example data accessible by storing the path
ex_xl<-readxl_example("datasets.xlsx")
ex_xl
## [1] "/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/readxl/extdata/datasets.xlsx"
Now, let's read in the data. The only required argument is a path to the file to be imported.
irisdata<-read_excel(ex_xl)
irisdata
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
Notice that the resulting imported data is a tibble. This is a feature specific to tidyverse. Now, let's check out some of the additional arguments. We can view the help information using ?read_excel()
.
The arguments likely to be most pertinent to you are:
sheet
- the name or numeric position of the excel sheet to read.col_names
- default TRUE uses the first read in row for the column names. You can also provide a vector of names to name the columns.skip
- will allow us to skip rows that we do not wish to read in..name_repair
- automatically set to "unique", which makes sure that the column names are not empty and are all unique.read_excel()
and *readr
functions will not correct column names to make them syntactic. If you want corrected names, use.name_repair = "universal"
.
Let's check out another example:
sum_air<-read_excel("./data/RNASeq_totalcounts_vs_totaltrans.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
sum_air
## # A tibble: 11 × 4
## `Uses Airway Data` ...2 ...3 ...4
## <chr> <chr> <chr> <chr>
## 1 Some RNA-Seq summary information <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 Sample Name Treatment Number of Transcripts Total C…
## 4 GSM1275863 Dexamethasone 10768 18783120
## 5 GSM1275867 Dexamethasone 10051 15144524
## 6 GSM1275871 Dexamethasone 11658 30776089
## 7 GSM1275875 Dexamethasone 10900 21135511
## 8 GSM1275862 None 11177 20608402
## 9 GSM1275866 None 11526 25311320
## 10 GSM1275870 None 11425 24411867
## 11 GSM1275874 None 11000 19094104
Upon importing this data, we can immediately see that something is wrong with the column names.
colnames(sum_air)
## [1] "Uses Airway Data" "...2" "...3" "...4"
There are some extra rows of information at the beginning of the data frame that should be excluded. We can take advantage of additional arguments to load only the data we are interested in. We are also going to tell read_excel()
that we want the names repaired to eliminate spaces.
sum_air<-read_excel("./data/RNASeq_totalcounts_vs_totaltrans.xlsx",
skip=3,.name_repair = "universal")
## New names:
## • `Sample Name` -> `Sample.Name`
## • `Number of Transcripts` -> `Number.of.Transcripts`
## • `Total Counts` -> `Total.Counts`
sum_air
## # A tibble: 8 × 4
## Sample.Name Treatment Number.of.Transcripts Total.Counts
## <chr> <chr> <dbl> <dbl>
## 1 GSM1275863 Dexamethasone 10768 18783120
## 2 GSM1275867 Dexamethasone 10051 15144524
## 3 GSM1275871 Dexamethasone 11658 30776089
## 4 GSM1275875 Dexamethasone 10900 21135511
## 5 GSM1275862 None 11177 20608402
## 6 GSM1275866 None 11526 25311320
## 7 GSM1275870 None 11425 24411867
## 8 GSM1275874 None 11000 19094104
Name repair
Learn more about the .name_repair
arguments here.
Tab-delimited files (.tsv, .txt)
In tab delimited files, data columns are separated by tabs.
To import tab-delimited files there are several options. There are base R functions such as read.delim()
and read.table()
as well as the readr
functions read_delim()
, read_tsv()
, and read_table()
.
Let's take a look at ?read.delim()
and ?read_delim()
, which are most appropriate if you are working with tab delimited data stored in a .txt file.
For read.delim()
, you will notice that the default separator (sep
) is white space, which can be one or more spaces, tabs, newlines. However, you could use this function to load a comma separated file as well; you simply need to use sep = ","
. The same is true of read_delim()
, except the argument is delim
rather than sep
.
Let's load sample information from the RNA-Seq project airway
. We will refer back to some of these data frequently throughout our lessons. The airway data is from Himes et al. (2014). These data, which are available in R as a RangedSummarizedExperiment object, are from a bulk RNAseq experiment. In the experiment, the authors "characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone," a common therapy for asthma. The airway package includes RNAseq count data from 8 airway smooth muscle cell samples. Each cell line includes a treated and untreated negative control.
Using read.delim()
:
smeta<-read.delim("./data/airway_sampleinfo.txt")
head(smeta)
## SampleName cell dex albut Run avgLength Experiment Sample
## 1 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345 SRS508568
## 2 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567
## 3 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349 SRS508571
## 4 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572
## 5 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353 SRS508575
## 6 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354 SRS508576
## BioSample
## 1 SAMN02422669
## 2 SAMN02422675
## 3 SAMN02422678
## 4 SAMN02422670
## 5 SAMN02422682
## 6 SAMN02422673
Some other arguments of interest for read.delim()
:
row.names
- used to specify row names.
col.names
- use to specify column names if header = FALSE
.
skip
- Similar to read_excel()
, used to skip a number of lines preceding the data we are interested in importing.
check.names
- makes names syntactically valid and unique.
Using read_delim()
:
smeta2<-read_delim("./data/airway_sampleinfo.txt")
## Rows: 8 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
## dbl (1): avgLength
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
smeta2
## # A tibble: 8 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSample
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275862 N61311 untrt untrt SRR10395… 126 SRX384345 SRS50… SAMN0242…
## 2 GSM1275863 N61311 trt untrt SRR10395… 126 SRX384346 SRS50… SAMN0242…
## 3 GSM1275866 N052611 untrt untrt SRR10395… 126 SRX384349 SRS50… SAMN0242…
## 4 GSM1275867 N052611 trt untrt SRR10395… 87 SRX384350 SRS50… SAMN0242…
## 5 GSM1275870 N080611 untrt untrt SRR10395… 120 SRX384353 SRS50… SAMN0242…
## 6 GSM1275871 N080611 trt untrt SRR10395… 126 SRX384354 SRS50… SAMN0242…
## 7 GSM1275874 N061011 untrt untrt SRR10395… 101 SRX384357 SRS50… SAMN0242…
## 8 GSM1275875 N061011 trt untrt SRR10395… 98 SRX384358 SRS50… SAMN0242…
Comma separated files (.csv)
In comma separated files the columns are separated by commas and the rows are separated by new lines.
To read comma separated files, we can use the specific functions ?read.csv()
and ?read_csv()
.
Let's see this in action:
cexamp<-read.csv("./data/surveys_datacarpentry.csv")
head(cexamp)
## record_id month day year plot_id species_id sex hindfoot_length weight
## 1 1 7 16 1977 2 NL M 32 NA
## 2 2 7 16 1977 3 NL M 33 NA
## 3 3 7 16 1977 2 DM F 37 NA
## 4 4 7 16 1977 7 DM M 36 NA
## 5 5 7 16 1977 3 DM M 35 NA
## 6 6 7 16 1977 1 PF M 14 NA
The arguments are the same as read.delim()
.
Let's check out read_csv()
:
cexamp2<-read_csv("./data/surveys_datacarpentry.csv")
## Rows: 35549 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): species_id, sex
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cexamp2
## # A tibble: 35,549 × 9
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 2 7 16 1977 3 NL M 33 NA
## 3 3 7 16 1977 2 DM F 37 NA
## 4 4 7 16 1977 7 DM M 36 NA
## 5 5 7 16 1977 3 DM M 35 NA
## 6 6 7 16 1977 1 PF M 14 NA
## 7 7 7 16 1977 2 PE F NA NA
## 8 8 7 16 1977 1 DM M 37 NA
## 9 9 7 16 1977 1 DM F 34 NA
## 10 10 7 16 1977 6 PF F 20 NA
## # ℹ 35,539 more rows
Other file types
There are a number of other file types you may be interested in. For genomic specific formats, you will likely need to install specific packages; check out Bioconductor for packages relevant to bioinformatics.
For information on importing other files types (e.g., json, xml, google sheets), check out this chapter from Tidyverse Skills for Data Science by Carrie Wright, Shannon E. Ellis, Stephanie C. Hicks and Roger D. Peng.
An Example
Let's load in a count matrix from airway
to work with and reshape.
aircount<-read.delim("./data/head50_airway_nonnorm_count.txt")
head(aircount)
## X Accession.SRR1039508 Accession.SRR1039509
## 1 ENSG00000000003.TSPAN6 679 448
## 2 ENSG00000000005.TNMD 0 0
## 3 ENSG00000000419.DPM1 467 515
## 4 ENSG00000000457.SCYL3 260 211
## 5 ENSG00000000460.C1orf112 60 55
## 6 ENSG00000000938.FGR 0 0
## Accession.SRR1039512 Accession.SRR1039513 Accession.SRR1039516
## 1 873 408 1138
## 2 0 0 0
## 3 621 365 587
## 4 263 164 245
## 5 40 35 78
## 6 2 0 1
## Accession.SRR1039517 Accession.SRR1039520 Accession.SRR1039521
## 1 1047 770 572
## 2 0 0 0
## 3 799 417 508
## 4 331 233 229
## 5 63 76 60
## 6 0 0 0
Because this is a count matrix, we want to save column 'X', which was automatically named, as row names rather than a column.
Let's reload and overwrite the previous object:
aircount<-read.delim("./data/head50_airway_nonnorm_count.txt",
row.names = 1)
head(aircount)
## Accession.SRR1039508 Accession.SRR1039509
## ENSG00000000003.TSPAN6 679 448
## ENSG00000000005.TNMD 0 0
## ENSG00000000419.DPM1 467 515
## ENSG00000000457.SCYL3 260 211
## ENSG00000000460.C1orf112 60 55
## ENSG00000000938.FGR 0 0
## Accession.SRR1039512 Accession.SRR1039513
## ENSG00000000003.TSPAN6 873 408
## ENSG00000000005.TNMD 0 0
## ENSG00000000419.DPM1 621 365
## ENSG00000000457.SCYL3 263 164
## ENSG00000000460.C1orf112 40 35
## ENSG00000000938.FGR 2 0
## Accession.SRR1039516 Accession.SRR1039517
## ENSG00000000003.TSPAN6 1138 1047
## ENSG00000000005.TNMD 0 0
## ENSG00000000419.DPM1 587 799
## ENSG00000000457.SCYL3 245 331
## ENSG00000000460.C1orf112 78 63
## ENSG00000000938.FGR 1 0
## Accession.SRR1039520 Accession.SRR1039521
## ENSG00000000003.TSPAN6 770 572
## ENSG00000000005.TNMD 0 0
## ENSG00000000419.DPM1 417 508
## ENSG00000000457.SCYL3 233 229
## ENSG00000000460.C1orf112 76 60
## ENSG00000000938.FGR 0 0
Working with row names
There are functions specific to the tibble
package for working with row names. column_to_rownames()
could also have been used to assign column X
to rows.
Data reshape
What do we mean by reshaping data?
Data reshaping is one aspect of tidying our data. The shape of our data is determined by how values are organized across rows and columns. When reshaping data we are most often wrangling the data from wide to long format or vice versa. To tidy the data we will need to (1) know the difference between observations and variables, and (2) potentially resolve cases in which a single variable is spread across multiple columns or a single observation is spread across multiple rows (R4DS).
It is difficult to provide a single definition for what is wide data vs long data, as both can take different forms, and both can be considered tidy depending on the circumstance.
Important
Remember, while we are interested in getting data into a "tidy" format, your data should ultimately be wrangled into a format that is going to work with downstream analyses.
In general, in wide data there is often a single metric spread across multiple columns. This type of data often, but not always, takes on a matrix like appearance.
While in long data, each variable tends to have its own column.
See this example from R4DS:
However, these definitions depend on what you are ultimately considering a variable and what you are considering an observation.
For example, which of the following data representations is the tidy option?
tibble(iris)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
iris_long<-tibble(iris) %>% rownames_to_column("Iris_id")
pivot_longer(iris_long,2:5,names_to="Measurement_location",values_to="Measurement")
## # A tibble: 600 × 4
## Iris_id Species Measurement_location Measurement
## <chr> <fct> <chr> <dbl>
## 1 1 setosa Sepal.Length 5.1
## 2 1 setosa Sepal.Width 3.5
## 3 1 setosa Petal.Length 1.4
## 4 1 setosa Petal.Width 0.2
## 5 2 setosa Sepal.Length 4.9
## 6 2 setosa Sepal.Width 3
## 7 2 setosa Petal.Length 1.4
## 8 2 setosa Petal.Width 0.2
## 9 3 setosa Sepal.Length 4.7
## 10 3 setosa Sepal.Width 3.2
## # ℹ 590 more rows
ggplot2
) seem to work better with long format data.
The tools we use to go from wide to long and long to wide are from the package tidyr
. Because we already loaded the package tidyverse
, we do not need to load tidyr
, as it is a core package.
pivot_wider()
and pivot_longer()
pivot_wider()
and pivot_longer()
have replaced the functions gather()
and spread()
. pivot_wider()
converts long format data to wide, while pivot_longer()
converts wide format data to long.
If you haven't guessed already, our count matrix is currently in wide format. If we wanted to merge these data with sample metadata and plot various aspects of the data using ggplot2, we would likely want these data in long format.
Let's check out the help documentation ?pivot_longer()
. This function requires the data and the columns we want to combine. There are also a number of optional arguments involving the name column and the value column.
l_air<-pivot_longer(aircount,1:length(aircount),names_to ="Sample",
values_to= "Count")
head(l_air)
## # A tibble: 6 × 2
## Sample Count
## <chr> <int>
## 1 Accession.SRR1039508 679
## 2 Accession.SRR1039509 448
## 3 Accession.SRR1039512 873
## 4 Accession.SRR1039513 408
## 5 Accession.SRR1039516 1138
## 6 Accession.SRR1039517 1047
Notice that the row names were dropped. While we would want to keep row names if we were working with this matrix as is, because we want a long data frame, we will need to first put the row names into a column. For this, we will use rownames_to_column()
from the tidyverse package tibble.
#save row names as a column
aircount<-rownames_to_column(aircount,"Feature")
head(aircount["Feature"])
## Feature
## 1 ENSG00000000003.TSPAN6
## 2 ENSG00000000005.TNMD
## 3 ENSG00000000419.DPM1
## 4 ENSG00000000457.SCYL3
## 5 ENSG00000000460.C1orf112
## 6 ENSG00000000938.FGR
#pivot longer...again
l_air<-pivot_longer(aircount,starts_with("Accession"),
names_to =c("Sample"),values_to= "Count")
head(l_air)
## # A tibble: 6 × 3
## Feature Sample Count
## <chr> <chr> <int>
## 1 ENSG00000000003.TSPAN6 Accession.SRR1039508 679
## 2 ENSG00000000003.TSPAN6 Accession.SRR1039509 448
## 3 ENSG00000000003.TSPAN6 Accession.SRR1039512 873
## 4 ENSG00000000003.TSPAN6 Accession.SRR1039513 408
## 5 ENSG00000000003.TSPAN6 Accession.SRR1039516 1138
## 6 ENSG00000000003.TSPAN6 Accession.SRR1039517 1047
How can we get this back to a wide format? We can use ?pivot_wider()
. This requires two additional arguments beyond the data argument: names_from
and values_from
. The first, names_from
should be the name of the column containing the new column names for your wide data. values_from
is the column that contains the values to fill the rows of your wide data columns.
Let's pivot the data from long to wide.
w_air<-tidyr::pivot_wider(l_air,names_from = Sample,
values_from = Count)
head(w_air)
## # A tibble: 6 × 9
## Feature Accession.SRR1039508 Accession.SRR1039509 Accession.SRR1039512
## <chr> <int> <int> <int>
## 1 ENSG0000000000… 679 448 873
## 2 ENSG0000000000… 0 0 0
## 3 ENSG0000000041… 467 515 621
## 4 ENSG0000000045… 260 211 263
## 5 ENSG0000000046… 60 55 40
## 6 ENSG0000000093… 0 0 2
## # ℹ 5 more variables: Accession.SRR1039513 <int>, Accession.SRR1039516 <int>,
## # Accession.SRR1039517 <int>, Accession.SRR1039520 <int>,
## # Accession.SRR1039521 <int>
Note
There are many optional arguments for both of these functions. These are there to help you reshape seemingly complicated data schemes. Don't get discouraged. The examples in the help documentation are extremely helpful.
Unite and separate
There are two additional functions from Tidyr
that are very useful for organizing data: unite()
and separate()
. These are used to split or combine columns.
For example, you may have noticed that our feature column from our example data is really two types of information combined (an Ensembl id and a gene abbreviation). If we want to separate this column into two, we could easily do this with the help of separate()
.
Let's see this in action. We want to separate the column Feature
at the first .
.
l_air2<-separate(l_air, Feature, into=c("Ensembl_ID","gene_abb"),
sep=".",remove=TRUE)
## Warning: Expected 2 pieces. Additional pieces discarded in 400 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
head(l_air2)
## # A tibble: 6 × 4
## Ensembl_ID gene_abb Sample Count
## <chr> <chr> <chr> <int>
## 1 "" "" Accession.SRR1039508 679
## 2 "" "" Accession.SRR1039509 448
## 3 "" "" Accession.SRR1039512 873
## 4 "" "" Accession.SRR1039513 408
## 5 "" "" Accession.SRR1039516 1138
## 6 "" "" Accession.SRR1039517 1047
Did you notice that warning?
Always take note of warnings. In this case, our column did not separate as expected. It appears that the sep
argument needs some adjustment. The description of the function separate()
suggests that it can use a regular expression to separate columns. This is our first clue. The sep
argument can't interpret what we are telling it to do. The .
has a special meaning in regular expressions; it matches any character.
If
.
matches any character, how do you match a literal.
? You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash,\
, to escape special behaviour. So to match an.
, you need the regexp\.
. Unfortunately this creates a problem. We use strings to represent regular expressions, and\
is also used as an escape symbol in strings. So to create the regular expression\.
we need the string\\.
. --- stringr vignette
Lost?
If this explanation of the \
makes little sense to you, check out this reddit post for more help.
l_air2<-separate(l_air, Feature, into=c("Ensembl_ID","gene_abb"),
sep="\\.",remove=TRUE)
head(l_air2)
## # A tibble: 6 × 4
## Ensembl_ID gene_abb Sample Count
## <chr> <chr> <chr> <int>
## 1 ENSG00000000003 TSPAN6 Accession.SRR1039508 679
## 2 ENSG00000000003 TSPAN6 Accession.SRR1039509 448
## 3 ENSG00000000003 TSPAN6 Accession.SRR1039512 873
## 4 ENSG00000000003 TSPAN6 Accession.SRR1039513 408
## 5 ENSG00000000003 TSPAN6 Accession.SRR1039516 1138
## 6 ENSG00000000003 TSPAN6 Accession.SRR1039517 1047
unite()
is simply the opposing function to separate()
. Let's use unite()
to combine our columns (Ensemble_ID
and gene_abb
) back together. This time we will use a _
between our ensembleID and gene abbreviations.
l_air3<-unite(l_air2, "Feature", c(Ensembl_ID,gene_abb),sep="_")
head(l_air3)
## # A tibble: 6 × 3
## Feature Sample Count
## <chr> <chr> <int>
## 1 ENSG00000000003_TSPAN6 Accession.SRR1039508 679
## 2 ENSG00000000003_TSPAN6 Accession.SRR1039509 448
## 3 ENSG00000000003_TSPAN6 Accession.SRR1039512 873
## 4 ENSG00000000003_TSPAN6 Accession.SRR1039513 408
## 5 ENSG00000000003_TSPAN6 Accession.SRR1039516 1138
## 6 ENSG00000000003_TSPAN6 Accession.SRR1039517 1047
A word about regular expressions
As you continue to work in R, at some point you will need to incorporate regular expressions into your code. We used a regular expression escape above \\.
to denote that we wanted to match a literal .
rather than its regex (short for regular expression) alternative, which matches any character except for a new line.
Regular expressions can be exceedingly complicated and like anything require time and practice. We will not take a deep dive into regular expressions in this course. A great place to start with regular expressions is Chapter 14: Strings from R4DS. You may also find this stringr vignette helpful.
Acknowledgements
Material from this lesson was inspired by R4DS and Tidyverse Skills for Data Science. The survey data loaded in the section on comma separated files was taken from a datacarpentry.org lesson.
Additonal Resources
readr / readxl cheatsheet
Tidyr cheatsheet
Stringr / regex cheatsheet