Skip to content

Data frames

Objectives
To be able to load, explore, and access data in a tabular format. To this end, students should understand the following:
1. how to import and export data
2. how to create, summarize, and subset data frames
3. what is a factor and why do we care?
4. how does a data frame differ from a matrix?

Reminder: Uploading files from RStudio Server

Any files created by you today will be erased at the end of the session. You can upload any files you downloaded from the last session using the Upload option in the Files pane.

Working with tabular data in R

In genomics, we work with a lot of tabular data (and non-tabular data). An old school method of working with this data may be to open in excel and manually work with the data. However, there are a number of reasons why this can be to your detriment. First, it is very easy to make mistakes when working with large amounts of tabular data in excel. Have you ever mistakenly left out a column while sorting data? Second, many of the files that we work with are so large (big data) that excel and your local machine do not have the bandwidth to handle them. Third, you will likely need to apply analytical techniques that are unavailable in excel. Lastly, it is difficult to keep track of any data manipulation steps or analyses in a point and click environment like excel.

R, on the other hand, can make analyzing tabular data more efficient and reproducible. But before getting into working with this data in R, let's review some best practices for data management.

Best Practices for organizing genomic data

  1. "Keep raw data separate from analyzed data" -- datacarpentry.org

    For large genomic data sets, I recommend having a project folder with two main subdirectories (i.e., raw_data and data_analysis). You may even consider changing the permissions (check out the unix command chmod) in your raw directory to make those files read only. Keeping raw data separate is not a problem in R, as one must explicitly import and export data.

  2. "Keep spreadsheet data Tidy" -- datacarpentry.org

    Data organization can be frustrating, and many scientists devote a great deal of time and energy toward this task. Keeping data tidy, which we will talk about more next lesson, can make data science more efficient, effective, and reproducible.

  3. "Trust but verify" -- datacarpentry.org

    R makes data analysis more reproducible and can eliminate some mistakes from human error. However, you should approach data analysis with a plan, and make sure you understand what a function is doing before applying it to your data. Hopefully, today's lesson will help with this. Often using small subsets of data can be used as a form of data debugging to make sure the expected result materialized.

    Some functions for creating practice data include: data.frame(), rep(), seq(), rnorm(), sample() and others. See some examples here.

Introducing the airway data

There are data sets available in R to practice with or showcase different packages. For today's lesson and the remainder of this course, we will use data from the Bioconductor package airway to showcase tools used for data wrangling and visualization. The use of this data was inspired by a 2021 workshop entitled Introduction to Tidy Transciptomics by Maria Doyle and Stefano Mangiola. Code has been adapted from this workshop to explore tidyverse functionality.

The airway data is from Himes et al. (2014). These data, which are contained within a RangedSummarizedExperiment, object are from a bulk RNAseq experiment. In the experiment, the authors "characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone," a common therapy for asthma. The airway package includes RNAseq count data from 8 airway smooth muscle cell samples. Each cell line includes a treated and untreated negative control. Note, current recommendations indicate that there should be 3-5 sample replicates for an RNAseq experiment.

Do not worry about the RangedSummarizedExperiment. The data we will use today and next week have been provided to you in the following files:

filtlowabund_scaledcounts_airways.txt Includes scaled transcript count data
diffexp_results_edger_airways.txt Includes results from differential expression analysis using EdgeR.

Object (.rds) files have also been included.

Note: Bioconductor will be discussed further in Lesson 4.

Importing / exporting data

Before we can do anything with our data, we need to first import it into R. There are several ways to do this.

First, the RStudio IDE has a dropdown menu for data import. Simply go to File > Import Dataset and select one of the options and follow the prompts. Note: readr is a tidyverse package but it isn't necessary for import. You can read more about readr and its advantages here.

IDE Import Let's focus on the base R import functions. These include read.csv(), read.table(), read.delim(), etc. You should examine the function arguments (e.g., ?read.delim()) to get an idea of what is happening at import and ensure that your data is being parsed correctly.

#Let's import our data and save to an object called scaled_counts
scaled_counts<-read.delim(
  "./data/filtlowabund_scaledcounts_airways.txt", as.is=TRUE)

We can now see this object in our RStudio environment pane.

This object can be viewed by clicking on it in the environment pane. Alternatively, you can use View(scaled_counts)

To import an existing object, we usereadRDS().

#Let's import our data from the .rds file
#and save to an object called scaled_counts_rds
scaled_counts_rds<-
  data.frame(readRDS("./data/filtlowabund_scaledcounts_airways.rds"))
Note: Using RStudio functionality, you can navigate to the files tab and click on the .rds file of interest. You will receive a prompt asking if you would like to load the object into R.

To export data to file, you will use similar functions (write.table(),write.csv(),saveRDS(), etc.). We will show how these work later in the lesson.

Examining and summarizing data frames

The object (scaled_counts) that we imported is a data frame. Data frames hold tabular data, like much of the data stored in spreadsheets. They are collections of vectors of the same length, but can be of different types. Because we often have data of multiple types, it is natural to examine that data in a data frame.

Summarizing data frames

Let's learn a bit more about our data frame. First, we can learn more about the structure of our data using str().

str(scaled_counts)

## 'data.frame':    127408 obs. of  18 variables:
##  $ feature      : chr  "ENSG00000000003" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460" ...
##  $ sample       : int  508 508 508 508 508 508 508 508 508 508 ...
##  $ counts       : int  679 467 260 60 3251 1433 519 394 172 2112 ...
##  $ SampleName   : chr  "GSM1275862" "GSM1275862" "GSM1275862" "GSM1275862" ...
##  $ cell         : chr  "N61311" "N61311" "N61311" "N61311" ...
##  $ dex          : chr  "untrt" "untrt" "untrt" "untrt" ...
##  $ albut        : chr  "untrt" "untrt" "untrt" "untrt" ...
##  $ Run          : chr  "SRR1039508" "SRR1039508" "SRR1039508" "SRR1039508" ...
##  $ avgLength    : int  126 126 126 126 126 126 126 126 126 126 ...
##  $ Experiment   : chr  "SRX384345" "SRX384345" "SRX384345" "SRX384345" ...
##  $ Sample       : chr  "SRS508568" "SRS508568" "SRS508568" "SRS508568" ...
##  $ BioSample    : chr  "SAMN02422669" "SAMN02422669" "SAMN02422669" "SAMN02422669" ...
##  $ transcript   : chr  "TSPAN6" "DPM1" "SCYL3" "C1orf112" ...
##  $ ref_genome   : chr  "hg38" "hg38" "hg38" "hg38" ...
##  $ .abundant    : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ TMM          : num  1.06 1.06 1.06 1.06 1.06 ...
##  $ multiplier   : num  1.42 1.42 1.42 1.42 1.42 ...
##  $ counts_scaled: num  960.9 660.9 367.9 84.9 4600.7 ...
str() shows us that we are looking at a data frame object with 127,408 observations in 18 variables (or columns). The column names are to the far left preceded by a $. This is a data frame accessor, and we will see how this works later. We can also see the data type (character, integer, logical, numeric) after the column name. This will help us understand how we can transform and visualize the data in these columns.

We can also get an overview of summary statistics of this data frame using summary().

summary(scaled_counts)
##    feature              sample          counts        SampleName       
##  Length:127408      Min.   :508.0   Min.   :     0   Length:127408     
##  Class :character   1st Qu.:511.2   1st Qu.:    66   Class :character  
##  Mode  :character   Median :514.5   Median :   310   Mode  :character  
##                     Mean   :514.5   Mean   :  1376                     
##                     3rd Qu.:517.8   3rd Qu.:   960                     
##                     Max.   :521.0   Max.   :513766                     
##      cell               dex               albut               Run           
##  Length:127408      Length:127408      Length:127408      Length:127408     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    avgLength      Experiment           Sample           BioSample        
##  Min.   : 87.0   Length:127408      Length:127408      Length:127408     
##  1st Qu.:100.2   Class :character   Class :character   Class :character  
##  Median :123.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :113.8                                                           
##  3rd Qu.:126.0                                                           
##  Max.   :126.0                                                           
##   transcript         ref_genome        .abundant           TMM        
##  Length:127408      Length:127408      Mode:logical   Min.   :0.9512  
##  Class :character   Class :character   TRUE:127408    1st Qu.:0.9706  
##  Mode  :character   Mode  :character                  Median :1.0052  
##                                                       Mean   :1.0006  
##                                                       3rd Qu.:1.0257  
##                                                       Max.   :1.0553  
##    multiplier    counts_scaled     
##  Min.   :1.026   Min.   :     0.0  
##  1st Qu.:1.230   1st Qu.:    95.4  
##  Median :1.467   Median :   445.8  
##  Mean   :1.466   Mean   :  1933.7  
##  3rd Qu.:1.581   3rd Qu.:  1369.6  
##  Max.   :2.136   Max.   :632885.3

Our data frame has 18 variables, so we get 18 fields that summarize the data. Counts, avgLength, TMM, multiplier, and counts_scaled are numerical data and so we get summary statistics on the min and max values for these columns, as well as mean, median, and interquartile ranges.

What is the length of our data.frame? What are the dimensions?

#length returns the number of columns
length(scaled_counts)
## [1] 18
#dimensions
dim(scaled_counts) #dim() returns the rows and columns
## [1] 127408     18

Other useful functions for inspecting data frames

Size:
nrow() - number of rows
ncol() - number of columns

Content:
head() - returns first 6 rows by default
tail() - returns last 6 rows by default

Names:
colnames() - returns column names
rownames() - returns row names

Section content from "Starting with Data", Introduction to data analysis with R and Bioconductor.

Data frame coercion and accessors

Notice that "sample" was treated as numeric, rather than as a character vector. If we intend to work with this column, we will need to convert it or coerce it to a character vector.

We can access a column of our data frame using [], [[]], or using the $.

Let's access "sample" from scaled_counts. We use head() to limit the printed output.

#Using $
head(scaled_counts$sample)  
## [1] 508 508 508 508 508 508
#Using []  
head(scaled_counts["sample"])
##   sample
## 1    508
## 2    508
## 3    508
## 4    508
## 5    508
## 6    508
#Using [[]]  
head(scaled_counts[["sample"]])
## [1] 508 508 508 508 508 508

Let's convert the "sample" column from an integer to a character vector. This is known as coercion.

#We can see that sample is being treated as numeric
is.numeric(scaled_counts$sample) 
## [1] TRUE
#let's convert it to a character vector
scaled_counts$sample<-as.character(scaled_counts$sample)
#check this
is.character(scaled_counts$sample) 
## [1] TRUE
#check this
is.numeric(scaled_counts$sample) 

## [1] FALSE
See other related functions (e.g., as.factor(),as.numeric()). More about factors later.

Be careful with data coercion. What happens if we change a character vector into a numeric?

#A warning is thrown and the entire column is filled with NA
head(as.numeric(scaled_counts$Sample)) 
## Warning in head(as.numeric(scaled_counts$Sample)): NAs introduced by coercion
## [1] NA NA NA NA NA NA

Some helpful things to remember

  • When you explicitly coerce one data type into another (this is known as explicit coercion), be careful to check the result. Ideally, you should try to see if its possible to avoid steps in your analysis that force you to coerce.
  • R will sometimes coerce without you asking for it. This is called (appropriately) implicit coercion. For example when we tried to create a vector with multiple data types, R chose one type through implicit coercion.
  • Check the structure (str()) of your data frames before working with them! ---datacarpentry.org

Using colnames() to rename the column "Sample" to "Accession".

#if unsure of the index of the "Sample" column, you could use
which(colnames(scaled_counts)=="Sample") 

#or you could get the indices in a data frame
data.frame(colnames(scaled_counts))

#Let's rename "Sample" to "Accession"
colnames(scaled_counts)[11]<-"Accession" 
#let's recheck our column names
colnames(scaled_counts)
##  [1] "feature"       "sample"        "counts"        "SampleName"   
##  [5] "cell"          "dex"           "albut"         "Run"          
##  [9] "avgLength"     "Experiment"    "Accession"     "BioSample"    
## [13] "transcript"    "ref_genome"    ".abundant"     "TMM"          
## [17] "multiplier"    "counts_scaled"

Test your learning

Which of the following will NOT print the "Run" column from scaled_counts?
a. scaled_counts$Run
b. scaled_counts["Run"]
c. scaled_counts[8,]
d. scaled_counts[8]

Solution
C

What is the column index for "avgLength" from the scaled_counts df?
a. 3
b. 8
c. 12
d. 9

Solution
D

Subsetting data frames

Subsetting a data frame is similar to subsetting a vector; we will use bracket notation []. However, a data frame is two dimensional with both rows and columns, so we can specify either one argument or two arguments (e.g., df[row,column]) depending. If you provide one argument, columns will be assumed. This is because a data frame has characteristics of both a list and a matrix. We will discuss matrices in a bit.

For now, let's focus on providing two arguments to subset. (Note when a df structure is returned)

scaled_counts[2,4] #Returns the value in the 4th column and 2nd row

scaled_counts[2, ] #Returns a df with row two

scaled_counts[-1, ] #Returns a df without row 1

scaled_counts[1:4,1] #returns a vector of rows 1-4 of column 1

#call names of columns directly
scaled_counts[1:10,c("sample","counts")]

#use comparison annotation
scaled_counts[scaled_counts$sample == "508",] 

What happens when we provide a single argument?

#notice the difference here
scaled_counts[,2] #returns column two
#treated similar to a matrix
#does not return a df if the output is a single column

scaled_counts[2] #returns column two
#treated similar to a list; maintains the df structure

#You can preserve the structure of the data frame while subsetting
# use drop = F
scaled_counts[,2,drop=F]
Some tips to remember for subsetting:

  • Typically provide two values separated by commas: data.frame[row, column]
  • In cases where you are taking a continuous range of numbers use a colon between the numbers (start:stop, inclusive)
  • For a non continuous set of numbers, pass a vector using c()
  • Index using the name of a column(s) by passing them as vectors using c() ---datacarpentry.org

Subsetting including simplifying vs preserving can get confusing. Here is a great chapter - though, a bit more advanced - that may clear things up if you are confused.

Introducing Factors

At this point, you have seen the term "factor" pop up a few times.

Factors can be thought of as vectors which are specialized for categorical data. Given R’s specialization for statistics, this make sense since categorial and continuous variables are usually treated differently. Sometimes you may want to have data treated as a factor, but in other cases, this may be undesirable. ---datacarpentry.org

Let's go ahead and coerce some of our character vectors to factors. "sample", "dex", "cell", and "transcript" are categorical variables.

First, let's create subset our data to only include the columns sample, counts, scaled_counts, transcript, and dex.

sscaled<-
  scaled_counts[
    c("sample","cell","dex","transcript","counts","counts_scaled")] 

str(sscaled)
## 'data.frame':    127408 obs. of  6 variables:
##  $ sample       : chr  "508" "508" "508" "508" ...
##  $ cell         : chr  "N61311" "N61311" "N61311" "N61311" ...
##  $ dex          : chr  "untrt" "untrt" "untrt" "untrt" ...
##  $ transcript   : chr  "TSPAN6" "DPM1" "SCYL3" "C1orf112" ...
##  $ counts       : int  679 467 260 60 3251 1433 519 394 172 2112 ...
##  $ counts_scaled: num  960.9 660.9 367.9 84.9 4600.7 ...
sscaled$sample<-as.factor(sscaled$sample) 
sscaled$dex<-as.factor(sscaled$dex)
sscaled$transcript<-as.factor(sscaled$transcript)
sscaled$cell<-as.factor(sscaled$cell)
#note there is an easier solution using tidyverse functionality

#Now let's look at the structure of our data frame
str(sscaled)
## 'data.frame':    127408 obs. of  6 variables:
##  $ sample       : Factor w/ 8 levels "508","509","512",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ cell         : Factor w/ 4 levels "N052611","N061011",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ dex          : Factor w/ 2 levels "trt","untrt": 2 2 2 2 2 2 2 2 2 2 ...
##  $ transcript   : Factor w/ 14575 levels "A1BG-AS1","A2M",..: 13065 3282 10793 1445 2179 4414 4553 7818 11940 7850 ...
##  $ counts       : int  679 467 260 60 3251 1433 519 394 172 2112 ...
##  $ counts_scaled: num  960.9 660.9 367.9 84.9 4600.7 ...
#get a summary
#notice that counts of the factor levels are returned
summary(sscaled) 

##      sample           cell          dex            transcript    
##  508    :15926   N052611:31852   trt  :63704   ABHD17AP1:    16  
##  509    :15926   N061011:31852   untrt:63704   ACBD6    :    16  
##  512    :15926   N080611:31852                 AGAP9    :    16  
##  513    :15926   N61311 :31852                 CBWD6    :    16  
##  516    :15926                                 FAM106A  :    16  
##  517    :15926                                 (Other)  :116648  
##  (Other):31852                                 NA's     : 10680  
##      counts       counts_scaled     
##  Min.   :     0   Min.   :     0.0  
##  1st Qu.:    66   1st Qu.:    95.4  
##  Median :   310   Median :   445.8  
##  Mean   :  1376   Mean   :  1933.7  
##  3rd Qu.:   960   3rd Qu.:  1369.6  
##  Max.   :513766   Max.   :632885.3  
## 
We see that "sample" is a factor with 8 levels; cell is a factor with 4 levels, dex a factor with 2 levels, and transcript a factor with 14,575 levels. The levels are the different groups or categories in those variables. R will organize the levels alphabetically by default.

For the sake of efficiency, R stores the content of a factor as a vector of integers, with an integer assigned to each of the possible levels. ---datacarpentry.org

This explains the numbers following the level names in the str() output.

This also results in some interesting behavior during variable coercion from is.factor() to is.numeric(). To coerce from a factor to a numeric, you have to first convert to a character vector. Otherwise, the numbers that you want to be numeric (the factor level names) will be returned as integers.
For example

#let's convert "sample" back to a numeric
head(as.numeric(sscaled$sample)) #notice the ones
## [1] 1 1 1 1 1 1
#instead we need to do the following
head(as.numeric(as.character(sscaled$sample)))
## [1] 508 508 508 508 508 508

Factors are necessary for plotting, and you can rename factors simply by manipulating the levels. For example

#Let's rename the levels in the dex column
#first let's see the levels
levels(sscaled$dex)
## [1] "trt"   "untrt"
#notice the column order is alphabetical
plot(sscaled$dex)

#Now, let's rename
#It is critical that the order is maintained
levels(sscaled$dex)<-c("treated","untreated")
levels(sscaled$dex) #We can see that the levels were modified
## [1] "treated"   "untreated"

To reorder the factor levels, we need to use factor() or fct_reorder() from tidyverse.

#We can explicitly state the order
sscaled$dex<-factor(sscaled$dex, levels=c("untreated","treated"))
plot(sscaled$dex) #now the order has changed

Bonus: Other factor reordering.

Bonus content

Let's say we filtered out all scaled_counts less than 10,000 and now we want to order our transcripts by transcripts found with at least 10k counts in the greatest number of samples (present at greater than 10k counts in more samples).

#let's get this filtered data set
#hopefully this notation seems familiar
transcript_f<-sscaled[sscaled$counts_scaled>10000,] 

#Let's look at this using head
head(table(transcript_f$transcript),n=10) 
## 
## A1BG-AS1      A2M  A2M-AS1   A4GALT     AAAS     AACS    AADAT    AAGAB 
##        0        7        0        0        0        0        0        0 
##     AAK1    AAMDC 
##        0        0
#We shouldn't really see zeros at this point. 
#A transcript should be found in at least one sample. 
#This is because the factor levels are maintained. 

#Let's drop the factor levels
transcript_f<-droplevels(transcript_f) 
#head again
head(table(transcript_f$transcript),n=10) 
## 
##     A2M   ABCA1  ABI3BP    ACTB   ACTG1   ACTN1   ACTN4  ADAM33   ADAM9 ADAMTS1 
##       7       6       4       8       8       8       8       3       8       7
#let's look at our current factor order
head(levels(transcript_f$transcript),n=10) 
##  [1] "A2M"     "ABCA1"   "ABI3BP"  "ACTB"    "ACTG1"   "ACTN1"   "ACTN4"  
##  [8] "ADAM33"  "ADAM9"   "ADAMTS1"
#let's reorder
ordered_factor_transcripts <- 
  factor(transcript_f$transcript,
         levels=names(sort(table(transcript_f$transcript),
                           decreasing=TRUE))) 

#Let's compare the output of sort to our new factor levels
head(sort(table(transcript_f$transcript),decreasing=TRUE), n=10) 
## 
##   ACTB  ACTG1  ACTN1  ACTN4  ADAM9  ADH1B    ADM  AHNAK AKAP12 AKR1C1 
##      8      8      8      8      8      8      8      8      8      8
head(levels(ordered_factor_transcripts), n=10)
##  [1] "ACTB"   "ACTG1"  "ACTN1"  "ACTN4"  "ADAM9"  "ADH1B"  "ADM"    "AHNAK" 
##  [9] "AKAP12" "AKR1C1"

Using tidyverse functionality (library(forcats)) we can easily reorder our factor levels by sorting along a different column using fct_reorder(). You should also look into fct_relevel().

#Let's filter our data to only include 4 transcripts of interest
keep_t<-c("CPD","EXT1","MCL1","LASP1")
interesting_trnsc<-sscaled[sscaled$transcript %in% keep_t,]
interesting_trnsc<-droplevels(interesting_trnsc)

#Look at a basic boxplot of the scaled_counts of these transcripts
boxplot(counts_scaled ~ transcript, data=interesting_trnsc)

#print levels
levels(interesting_trnsc$transcript) 
## [1] "CPD"   "EXT1"  "LASP1" "MCL1"
#Reorder the transcript factor levels by the maximum of counts_scaled
interesting_trnsc$transcript<-
  forcats::fct_reorder(interesting_trnsc$transcript,
              interesting_trnsc$counts,max) 

#plot
boxplot(counts_scaled ~ transcript, data=interesting_trnsc) 

Test your learning

How many categories or levels are there in sscaled$cell?
a. 4
b. 2
c. 7
d. 1

Solution
A

What are the dimensions of scaled_counts[scaled_counts$counts_scaled <= 500,]?
a. 127408, 6
b. 66909, 18
c. 3, 3
d. 3595, 6

Solution
B

Bonus: Find and Replace using bracket subsetting

Bonus content

There are infinite uses for find and replace functionality, and like most topics in R, there are multiple ways to search for and replace values in a data frame.

You could use bracket sub-setting. Let's say we noticed a typo in a gene annotation in our interesting_trnsc data frame. We want to search for the "MCL1" gene and replace it with "MCL2". We could do the following:

#The column interesting_trnsc$transcript is a vector of gene names 
#in the interesting_trnsc data frame.
#We can subset like we do with vectors
interesting_trnsc$transcript[interesting_trnsc$transcript=="MCL1"] 
## [1] MCL1 MCL1 MCL1 MCL1 MCL1 MCL1 MCL1 MCL1
## Levels: EXT1 MCL1 LASP1 CPD
#we found the gene of interest; now, let's replace with MCL2
interesting_trnsc$transcript[
  interesting_trnsc$transcript=="MCL1"]<-"MCL2" 
## Warning in `[<-.factor`(`*tmp*`, interesting_trnsc$transcript == "MCL1", :
## invalid factor level, NA generated
#Ah, we received a warning and the values became NAs. 
#This is because transcript is a factor with set levels.
#let's replace those NAs with our original factor level MCL1
interesting_trnsc$transcript[
  is.na(interesting_trnsc$transcript)]<-"MCL1"

#To change the MCL1 value, we can simply change the factor level
levels(interesting_trnsc$transcript)[2] <-"MCL2"

#Alternatively, we could change this factor to a character vector. 
interesting_trnsc$transcript<-
  as.character(interesting_trnsc$transcript)
#Let's change MCL2 back to MCL1
interesting_trnsc$transcript[interesting_trnsc$transcript=="MCL2"]<-
  "MCL1"

#if this typo was present in multiple columns we could use
#just creating the same column elsewhere to test
interesting_trnsc$new_transcripts<-interesting_trnsc$transcript 
interesting_trnsc[interesting_trnsc=="MCL1"]<-"MCL2"

#Be careful if these columns are factors

There are also a number of functions that have similar functionality. Check out sub() and gsub(). According to R documentation, "sub and gsub perform replacement of the first and all matches respectively". gsub() can only work on a single vector for replacement in a data frame. For global replacement across an entire data frame, you will have to get more creative if using gsub(). We will not be covering the apply functions, but they are useful and can often be used in the place of a complicated for loop. Here is a nice tutorial for those that are interested. The advantage of using functions related to pattern matching and replacement (e.g., gsub() or sub()) is that you can use regular expressions to match complicated patterns.

Save our data frame to a file

Perhaps we want to use our sscaled df in another program. Let's save this to a file.

write.table(sscaled, 
            file = "sscaled.txt",
            quote=FALSE,row.names=FALSE,sep="\t") 
#if you don't know what these arguments mean, 
#use ?write.table to get help.

Introduction to data matrices

Another important data structure in R is the data matrix. Data frames and data matrices are similar in that both are tabular in nature and are defined by dimensions (ie. rows (m) and columns (n), commonly denoted m x n). Note that a vector can be viewed as a 1 dimensional matrix.

Elements in a matrix and a data frame can be referenced by using their row and column indices (for example, a[1,1] references the element in row 1 and column 1).

However, the primary difference between a df and a matrix is that a matrix only contains values of a single type (i.e., numeric, character, logical, etc.).

Below, we create the object a1, a 3 row by 4 column matrix.

a1 <- matrix(c(3,4,2,4,6,3,8,1,7,5,3,2), ncol=4)
a1

##      [,1] [,2] [,3] [,4]
## [1,]    3    4    8    5
## [2,]    4    6    1    3
## [3,]    2    3    7    2
Using the typeof() and class() command, we see that the elements in a1 are double and a1 a matrix, respectively.

typeof(a1)
## [1] "double"
class(a1)

## [1] "matrix" "array"
Earlier, we mentioned that elements in a matrix can be referenced by their row and column number. Below, we extract the element in the 3rd row and 4th column of a1 (which is 2)

a1[3,4] ## returns 2

## [1] 2
We can assign column and row names to a matrix.

colnames(a1) <- c("control1","control2","tumor1","tumor2")
rownames(a1) <- c("ADA","AMPD2","HPRT")
a1

##       control1 control2 tumor1 tumor2
## ADA          3        4      8      5
## AMPD2        4        6      1      3
## HPRT         2        3      7      2
But, we cannot reference columns using $.

a1$control1

## Error in a1$control1: $ operator is invalid for atomic vectors
We can create matrices mixed with words and numbers (see a2).

a2 <- matrix(c("apples","pears","oranges",50,25,75), ncol=2)
a2

##      [,1]      [,2]
## [1,] "apples"  "50"
## [2,] "pears"   "25"
## [3,] "oranges" "75"
But, R will coerce all of the elements to character.

typeof(a2)
## [1] "character"
typeof(a2[,2])
## [1] "character"
class(a2)
## [1] "matrix" "array"

We can also perform mathematical operations on matrices.

a3 <- 5
a3

## [1] 5
Below we multiply every element in a1 by a3 and store in a4. Note, we are still left with a 3 by 4 matrix except the values have been multiplied by the value assigned to a3 (5).

a4 <- a1*a3
a1
##       control1 control2 tumor1 tumor2
## ADA          3        4      8      5
## AMPD2        4        6      1      3
## HPRT         2        3      7      2
a4
##       control1 control2 tumor1 tumor2
## ADA         15       20     40     25
## AMPD2       20       30      5     15
## HPRT        10       15     35     10

Here are some similarities and differences between matrices and data frames:

##                              Characteristic Matrix Data.frame
## 1                 is rectangular data table    yes        yes
## 2               can perform math operations    yes        yes
## 3                needs homogenous data type    yes         no
## 4          can have heterogeneous data type     no        yes
## 5 can reference using row and column number    yes        yes
## 6              can reference column using $     no        yes
## 7                      can use for plotting    yes        yes

Exporting files from RStudio

Remember, because we are using RStudio server through DNAnexus, any files created by you today will be erased at the end of the session.

To use the materials you generated on the RServer on DNAnexus on your local computer, you will need to export your files using More and Export in the Files pane.

Acknowledgements

Material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org and from a 2021 workshop entitled Introduction to Tidy Transciptomics by Maria Doyle and Stefano Mangiola.

Resources

  1. BaseR cheatsheet