Introduction to R for Biologists

David Wheeler, PhD, Laboratory of Biochemistry and Molecular Biology, NCI
1/22/2016

What is R?

  • R is a statistical program that can be used interactively to explore data or used as a programming language.

    • It is extensible via libraries which can add both new functions for analysis and new data for reference.
  • Bioconductor provides the foundation for almost a thousand libraries that allow biologists to analyze high throughput data.

R-Central, and a Cheat Sheet

The Virtues of R

  • Elegant syntax for both interactive use and programming
  • Vectorized methods that minimize the need for user-specified loops
  • Simple methods to produce a wide array of stunning graphics
  • Easily extensible with hundreds of libraries, e.g. Bioconductor
  • Extensive help system

Starting R

  • When R starts, it prints the following message:
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
demo(graphics) # for a good introduction to graphics

The Global Environment

  • Starting an interactive R session puts you into the global R Environment

  • You can think of your global R Environment as the equivalent of a virtual disk

    • You can list the variables in it …
    • …and create new variables but …
    • …when you leave R, unless you save the Environment, you will lose your work

Saving Your Work

  • You can save the Environment at any time during an R session using save.image()

  • R offers to save this Environment for you whenever you quit

  • By default, R saves the Environment in the compressed file, .RData

  • This .RData image is loaded automatically the next time you start R

Project Directories

  • A common practice among R users is to take advantage of the .RData image files by making a new directory for each R project

  • In that case, each project has its own .RData file and that file is automatically loaded when you start R in the project directory

A First Look at Some Data

  • Some packages also come with sample data sets.

  • To see these, use data()

  • To load a particular data set, use e.g. data(DNase)

Loading Example Data

  • data() loads an example data set
ls()  # there are no variables
character(0)
data(DNase)  # import `DNase into your Environment

ls()  # the new data set now shows up in the listing of variables
[1] "DNase"

Seeing the Beginning of a Variable

  • head() is used to see the first lines of a variable
head(DNase)  # see the first 6 lines
  Run       conc density
1   1 0.04882812   0.017
2   1 0.04882812   0.018
3   1 0.19531250   0.121
4   1 0.19531250   0.124
5   1 0.39062500   0.206
6   1 0.39062500   0.215

Seeing the End of a Variable

  • tail() is used to see the last lines of a variable
tail(DNase)  # see the last 6 lines
    Run   conc density
171  11  3.125   0.994
172  11  3.125   0.980
173  11  6.250   1.421
174  11  6.250   1.385
175  11 12.500   1.715
176  11 12.500   1.721

Data Summaries

  • Use summary() to get a tabular summary
summary(DNase[, 2:3])  # get a statistical summary
      conc             density      
 Min.   : 0.04883   Min.   :0.0110  
 1st Qu.: 0.34180   1st Qu.:0.1978  
 Median : 1.17188   Median :0.5265  
 Mean   : 3.10669   Mean   :0.7192  
 3rd Qu.: 3.90625   3rd Qu.:1.1705  
 Max.   :12.50000   Max.   :2.0030  


Your Command History

  • You can see your history using history()
  • …or by using the up and down cursor keys.
  • You can load or save command histories.
  • Your history is saved in .Rhistory on exit.
history(2) # see your last 2 commands

# save all the commands issued during this session to "h1.Rhistory"
savehistory("h1.Rhistory") 

loadhistory("h1.Rhistory") # load a saved history

Ending a Session

  • Type quit() to exit R.
  • R will offer to save your current Environment.
  • You should generally reply “y” to this prompt.
  • A normal exit will look something like this:
quit()

Save workspace image to ~/.RData? [y/n/c]:

Listing Variables

  • R provides a few unix-like commands for manipulating variables.
ls()  # list variables and functions
[1] "DNase"
NewVar = "abc"  # create a new variable

ls()  # NewVar appears in the listing
[1] "DNase"  "NewVar"

Handling Variables

  • Type a variable's name to display it.
  • Use rm() to delete it.
NewVar  # display NewVar
[1] "abc"
rm(NewVar)  # remove NewVar

ls()  # it is gone
[1] "DNase"

apropos of Help!

  • Suppose we want to run a T-test but…
  • …have no clue as to where to begin.
# generate some possibilites...
tail(apropos("test"), 12)
 [1] "power.t.test"            "PP.test"                
 [3] "prop.test"               "prop.trend.test"        
 [5] "quade.test"              "shapiro.test"           
 [7] "t.test"                  "testInheritedMethods"   
 [9] "testPlatformEquivalence" "testVirtual"            
[11] "var.test"                "wilcox.test"            

Help!

  • help() documents a function
  • Use '?' as a shortcut, e.g. ?t.test

help(t.test)

t.test                  package:stats                  R Documentation

Student's t-Test

Description:

     Performs one and two sample t-tests on vectors of data



Getting an Example

example("t.test")

t.test> require(graphics)

t.test> t.test(1:10, y = c(7:20))      # P = .00001855

  Welch Two Sample t-test

data:  1:10 and c(7:20)
t = -5.4349, df = 21.982, p-value = 1.855e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.052802  -4.947198
sample estimates:
mean of x mean of y 
      5.5      13.5 


Packages Extend R

  • install.packages() installs a package
  • library('packagename') loads an installed package
  • library() lists installed packages
  • search() lists the packages that are loaded

Loading a Library

library()  # see what is available

library("Biobase")  # load a library

search()  # see what is loaded
 [1] ".GlobalEnv"           "package:Biobase"      "package:BiocGenerics"
 [4] "package:parallel"     "package:knitr"        "package:stats"       
 [7] "package:graphics"     "package:grDevices"    "package:utils"       
[10] "package:datasets"     "package:methods"      "Autoloads"           
[13] "package:base"        

Logging a Session to a File

  • sink() redirects console output to a file
  • sink() is quick way to export data
  • Only output is logged, commands are not
# print to file but also to console; don't overwrite previous output
sink("logfile.log", split = TRUE, append = TRUE)


sink()  # close the logfile when you are finished

Running External R Code

ls()  # listing shows one variable
[1] "DNase"
# create an R source file
cat("M='hi there'", file = "test.R")

# read and execute the R source file
source("test.R")


ls()  # now there are two variables
[1] "DNase" "M"    

Data Types

  • Atomic Types
  • Complex Types

Atomic Types

Atomic Types are the basic elements used in R. The more complex objects (a designation that includes but is not limited to variables and functions) that you will use in your analyses are built from these.

  • numeric
  • integer
  • missing values
  • character
  • logical

Numerics

  • 10 is a numeric

  • 3.14159 is also a numeric

  • We see this using class().

class(10)
[1] "numeric"
class(3.14159)
[1] "numeric"

Integers

  • Integers are generated by range operations
range = 1:10

range
 [1]  1  2  3  4  5  6  7  8  9 10
class(range)
[1] "integer"

Missing Values

  • NA means “not available”
nana = c(1, 2, 3, 4, 5, 6, NA, NA, 9, 10)

is.na(nana)  # generates a logical value for each member of nana
 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
anyNA(nana)  # return `TRUE` if any member of `nana` is NA
[1] TRUE

NA's are Problematic

  • One cannot perform statistics on a vector containing a NA
mean(nana)  # NA if any element is NA
[1] NA
sum(nana)  # same problem
[1] NA
mean(nana, na.rm = TRUE)  # ignores the NA's
[1] 5

Characters

  • Hold strings of text
  • use nchar() to get the length of a string
  • length() returns the length of the variable holding the string
a = c("ABCDEFG", "ABC")
length(a)
[1] 2
nchar(a[1])
[1] 7
nchar(a)
[1] 7 3

Built-in Character Constants

  • letters, and LETTERS.
  • Useful for labeling arrays of variables
letters[1:10]
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
LETTERS[1:10]
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"

Combining Text and Numbers

  • paste combines using a separator such as a space
  • paste0 combines with no separator
  • Useful for making figure titles, axis labels
  • Useful for making arrays of labels
prefix = "I've always said that "

suffix = ", and that's a fact."

paste0(prefix, "pi,", pi, ", is a pre-defined constant in R", suffix)
[1] "I've always said that pi,3.14159265358979, is a pre-defined constant in R, and that's a fact."
paste("Label", 1:10)
 [1] "Label 1"  "Label 2"  "Label 3"  "Label 4"  "Label 5"  "Label 6" 
 [7] "Label 7"  "Label 8"  "Label 9"  "Label 10"

Logicals

  • TRUE and FALSE are constants with values
TRUE == 1  # this is TRUE
[1] TRUE
FALSE == 0  # this is also TRUE
[1] TRUE
FALSE == 1  # but this is FALSE
[1] FALSE

AND, OR, NOT

  • The AND and OR operators are & and |, respectively.
  • The NOT operator is !
(10 >= 9 & 10 < 10) | 10 == 10
[1] TRUE
!((10 >= 9 & 10 < 10) | 10 == 10)
[1] FALSE

Logicals as Subscripts

  • Logical vectors can subset other vectors
  • Only TRUE elements are kept
b = c(1:3, -5:0)  # make a little vector

b > 0  # for each 'b' we have one logical value
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
b[b > 0]  # return only the positive values in b
[1] 1 2 3

Complex Objects

Object Composition
vector an array of elements of the same type
list an ray of elements of differing types
matrix a vector with dimensions
data.frame list with dimensions and elements (the columns) of identical length
factor a vector of group identifiers
function a self-contained block of R commands

Vectors

  • The expression 1+1 is a vector of length 1
  • Elements of a vector are specified within square brackets, e.g. a[1]
  • Indices start at 1
  • c() combines objects into a vector
b = c(1, 2, -30, 40, -50)

b[c(3, 5)]
[1] -30 -50

Ranges are Vectors

  • Ranges of integers can be generated using :
  • They can be used to specify the indices of a vector
2:4  # a vector of integers
[1] 2 3 4
b[2:4]  # get items 2,3, and 4 in b
[1]   2 -30  40

Matrices

  • Matrices are restructured vectors
  • Elements are addressed by row and column e.g. M[2,5]
  • …but can still be addressed by single number, e.g. M[10]
  • Dimensions are returned and set using dim()

Matrices

M = 1:12  # a vector

# restructure N as a matrix
N = matrix(data = M, nrow = 3, ncol = 4)

dim(N)  # get its dimensions
[1] 3 4
N[1:2, ]  # see first 2 rows of N
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11

Matrices

dim(N) = c(6, 2)  # change N's dimensions

N[1:2, ]  # see first 2 rows of N
     [,1] [,2]
[1,]    1    7
[2,]    2    8
N[10]  # treat the matrix N as a vector
[1] 10

Computing Columnwise

colSums(N)  # we can take column sums
[1] 21 57
colMeans(N)  # we can take column means
[1] 3.5 9.5

Computing on Rows

What are the corresponding methods for rows?

Lists

  • A list is a vector-like object
  • A list differs from a vector in that its elements can contain a combination of (vectors, matrices, data.frames, or other lists)

  • You can address the elements of a list using two operators; [[]], and $

  • The first form is similar to a vector subscript and takes a variable between the brackets

  • The $ allows access to the elements by name

Constructing a List

L = list()  # use the constructor function

L$first = c(1, 2, 3)  # add 'first' to the list

L[[2]] = c(4, 5, 6)  # add a second un-named item

L  #take a look
$first
[1] 1 2 3

[[2]]
[1] 4 5 6

List Elements Have Names

names(L)  # the names in 'L'
[1] "first" ""     
names(L)[2] = "second"  # name this element

L  # take a look
$first
[1] 1 2 3

$second
[1] 4 5 6
L$second  # access the second element
[1] 4 5 6

Data Frames

  • A data.frame is a list consisting of elements, treated as columns, of identical length.

  • Individual elements within a data.frame can be accessed using the double subscript notation of the matrix within the double square brackets of the list

Creating and Handling Data Frames

D = data.frame(A = 1:10, twiceA = 2 * 1:10)


D[[2]]  # get element 2 which is column 2
 [1]  2  4  6  8 10 12 14 16 18 20
names(D)  # get the names of 'D'
[1] "A"      "twiceA"
D$twiceA  # get column 'twiceA'
 [1]  2  4  6  8 10 12 14 16 18 20

Custom Objects

  • Custom objects are returned by many functions in R
  • They often come in the form of lists
  • The list returned by lm() is an example

Data for Our Linear Model

  • We generate identical vectors, X and Y
  • We then use rnorm() to add noise to Y
# generate 1000 random deviates with Standard Deviation of 100
noise = rnorm(1000, mean = 0, sd = 100)

x = 1:1000  # generate a vector of 1000 integers

y = 1:1000 + noise  # same as X but with noise

Plot the Data

plot(y ~ x)  # plot y 'as.a.function.of' x

plot of chunk unnamed-chunk-30

Generate the Linear Model Object

  • The model is specified using the tilde (~)
  • The tilde can be read as is a function of
LF = lm(y ~ x)  # perform a linear fit

class(LF)  # see the class of LinFit
[1] "lm"

The Components of LF

names(LF)  # see the elements of LinFit
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        
LF$coefficients
(Intercept)           x 
  -1.184735    1.009575 

Get the Slope and Intercept Terms

plot(y ~ x)  # must call plot() for setup
# draw the best fit line

abline(a = LF$coefficients[1], b = LF$coefficients[2], col = "red", lwd = 6)

plot of chunk unnamed-chunk-34

Factors

  • Factors group data for statistical comparison
  • The levels() of a factor give the number of groups
F = factor(c("A", "B", "B", "A"))  # construct a factor

F  # take a look
[1] A B B A
Levels: A B
levels(F)  # get the levels of F
[1] "A" "B"

Using Factors in a T-test

  • We will create a small data frame with two columns
  • The group column becomes a factor
# create a data.frame with two columns called 'score' and 'group'

D = data.frame(score = c(100, 110, 95, 70, 75, 65), group = c("A", "A", "A", 
    "B", "B", "B"))

class(D$group)  # what has happened?
[1] "factor"

Running the T-test

  • The ~ operator used in the t.test() function defines the model to be tested–that score is a function of group.
T = t.test(D$score ~ D$group)  # run it
names(T)
[1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
[6] "null.value"  "alternative" "method"      "data.name"  
T$p.value
[1] 0.006133406

Changing the Levels

levels(D$group)  # remind ourselves of the levels
[1] "A" "B"
levels(D$group) = c("control", "exp1")  #change the levels

levels(D$group)  #verify the change
[1] "control" "exp1"   

A Corresponding Box Plot

boxplot(D$score ~ D$group)

plot of chunk unnamed-chunk-40

Functions

  • A convenient way to package parts of an analysis
  • Data is passed to functions, e.g. x
  • Functions also return data, e.g. y
Fsq = function(x) {

    y = x^2

    return(y)

}

Fsq(3)
[1] 9

Procedures in R

  • Writing and reading data
  • Restructuring, subsetting, summarizing and transforming data
  • Other procedures

Writing and Reading Data

  • read.table() reads tabular data
  • write.table() writes tabular data
  • Usually no need to specify data separators
  • But, data separators can be specified using the parameter sep.
  • Specialized variants such as read.csv() and read.delim() correspond to read.table() but with different default parameters.

Writing and Reading a Matrix

data(geneData)  # load `geneData`

geneData[1, 1:3]  # see a little of it
       A        B        C 
192.7420  85.7533 176.7570 
write.table(geneData, file = "gd")  # write

GD = read.table(file = "gd")  # read

GD[1, 1:3]  # take a look
                     A       B       C
AFFX-MurIL2_at 192.742 85.7533 176.757

What happened?

Exploring an Object

  • class(),str(), length(), dim(), and object.size help you take the measure of an object
M = data.frame(A = 1:5, B = 2 * (1:5), C = 2^(1:5))

class(M)  # get its class
[1] "data.frame"
object.size(M)  # find out how much memory it is using
1024 bytes

Exploring an Object

length(M)  # get the length
[1] 3
dim(M)  # get the dimensions
[1] 5 3
str(M)  # see its internal structure
'data.frame':   5 obs. of  3 variables:
 $ A: int  1 2 3 4 5
 $ B: num  2 4 6 8 10
 $ C: num  2 4 8 16 32

Exploring an Object

head(M, 2)  # see the first 2 lines
  A B C
1 1 2 2
2 2 4 4
tail(M, 2)  # see the last 2 lines
  A  B  C
4 4  8 16
5 5 10 32

Exploring an Object

names(M)  # only for lists
[1] "A" "B" "C"
colnames(M)  # get the column names
[1] "A" "B" "C"
rownames(M)  # get the rownames
[1] "1" "2" "3" "4" "5"

Exploring an Object

dimnames(M)  # get both row names and column names
[[1]]
[1] "1" "2" "3" "4" "5"

[[2]]
[1] "A" "B" "C"

Why is the length of a data.frame equal to the number of columns?

Summarizing Data

  • summary(), min(), max(), range(), mean(), sd(), and others summarize data
summary(M)  # a statistical summary
       A           B            C       
 Min.   :1   Min.   : 2   Min.   : 2.0  
 1st Qu.:2   1st Qu.: 4   1st Qu.: 4.0  
 Median :3   Median : 6   Median : 8.0  
 Mean   :3   Mean   : 6   Mean   :12.4  
 3rd Qu.:4   3rd Qu.: 8   3rd Qu.:16.0  
 Max.   :5   Max.   :10   Max.   :32.0  

Why does this happen if we try to take the mean of all columns of M?

mean(M)
[1] NA

Editing Data

  • R comes with a built-in GUI editor
fix(M) # invoke the editor, make your changes and exit

edit(M) # invoke the editor but discard any changes on exit

N=edit(M) # invoke the editor, but assign the changed version to another variable

Other Editors

  • You may also use a console editor such as pico
  • Or a full-featured graphical editor such as emacs

M=pico(M) # invoke "pico", make changes and save them into "M" again on exit


M=emacs(M) # invoke the 'emacs' editor, and save your changes on exit

Dealing with Missing Data

  • Missing data is problematic
nana = c(1, 2, 3, 4, 5, 6, NA, NA, 9, 10)  # nana has to NA's

sum(nana)  # returns NA 
[1] NA

One way to fix this, as we've seen:

sum(nana, na.rm = TRUE)  # we can ignore NA's on a per function basis
[1] 40

Replacing Missing Data

  • Another method is to put something in place of the missing data
nana[is.na(nana)] = 0  # we can also change them into something else

nana  # take a look--the NA's are now zeros
 [1]  1  2  3  4  5  6  0  0  9 10

Why does this method work?

Restructuring Data

  • We can rearrange columns of a matrix or data frame
M = data.frame(A = 1:3, B = 101:103, C = 31:33)

# rearrange the columns
M = data.frame(C = M$C, A = M$A, B = M$B)

Building by Columns

  • cbind() joins objects together by columns
N = cbind(C = M$C, A = M$A, B = M$B)  # use `cbind()` to bind the colums together

head(N)  # take a look
      C A   B
[1,] 31 1 101
[2,] 32 2 102
[3,] 33 3 103

Building by Rows

  • rbind() joins objects together by rows
N = rbind(N, c(1000, 2000, 3000))  # add another row

tail(N)
        C    A    B
[1,]   31    1  101
[2,]   32    2  102
[3,]   33    3  103
[4,] 1000 2000 3000

Removing Columns from a Matrix

  • We can delete a column in a matrix or data.frame using a negative subscript
head(N, 2)  # see the first two lines
      C A   B
[1,] 31 1 101
[2,] 32 2 102
N = N[, -1]  # delete column 1

head(N, 2)  # verify the deletion
     A   B
[1,] 1 101
[2,] 2 102

Removing Rows from a Matrix

  • We can delete multiple rows using a vector
head(N)  # take a look at 'N'
        A    B
[1,]    1  101
[2,]    2  102
[3,]    3  103
[4,] 2000 3000
N = N[c(-4, -3, -2), ]

head(N)
  A   B 
  1 101 

Labeling Data

  • names(), colnames(), and rownames() are used both to get and set names
M = data.frame(A = 1:5, B = 21:25)

rownames(M)  # the row names are now numbers
[1] "1" "2" "3" "4" "5"
rownames(M) = paste("gene", 1:5, sep = "_")

rownames(M)  # verify the replacement
[1] "gene_1" "gene_2" "gene_3" "gene_4" "gene_5"

Subsetting Data

  • There are many techniques for sub-setting data
H = head(M, 2)  # the first two rows, all columns

T = tail(M, 2)  # the last two rows, all columns


N = M[5, ]  # row 5, all columns

N = M[, 2]  # column 2, all rows

N = M[1:3, ]  # rows 1-3, all columns

N = M[c(1, 5, 9), ]  # all columns, rows 1, 5, and 9

Transforming Rows or Columns

  • The apply() function is a general way to apply any function to the rows or the columns of a matrix or data.frame

  • Similar functions lapply() and sapply() can be used on the elements of a list

GD = apply(geneData, 1, median)  # medians of the rows (second parameter=1)

GD = apply(geneData, 2, median)  # medians of the columns (second parameter=2)

Graphing Data

  • Let's draw 4 bar plots on a single page
  • We'll use a layout matrix that assigns sectors of the graphics window to graphs by number
data(geneData)
layout(matrix(c(1:4), 2, 2))  # use a layout matrix


matrix(c(1:4), 2, 2)  # here is the layout matrix
     [,1] [,2]
[1,]    1    3
[2,]    2    4

Set the Layout and Run the Loop

layout(matrix(c(1:4), 2, 2))  # use a layout matrix
for (j in 1:4) {
    # let j vary over the range 1 to 4

    barplot(geneData[, j], main = paste("Array", j))
}

plot of chunk unnamed-chunk-63

A More Complex Layout

  • The layout is 3x3
  • But we assign 4 of the sectors to one graph
# here is the layout matrix we'll use

matrix(c(1, 1, 2, 1, 1, 3, 4, 5, 6), 3, 3)
     [,1] [,2] [,3]
[1,]    1    1    4
[2,]    1    1    5
[3,]    2    3    6

Set the Layout and Run the Loop

layout(matrix(c(1, 1, 2, 1, 1, 3, 4, 5, 6), 3, 3))

for (j in 1:6) {
    # let j vary over the range 1 to 6

    barplot(geneData[, j], main = paste("Array", j))  # barplot the data in all the rows of column 'j'

}

plot of chunk unnamed-chunk-65

Writing Graphics to a File

  • Plots can be redirected to a PNG or PDF file and re-sized in the process if need be
png("plot1", width = 500, height = 500)  # open a graphics device that will put subsequent graphics into 'plot1'

dev.off()  # write the graphics file and close the graphics device

Project 1: A Simple Analysis of Probe Intensity Data

  • We will analyze geneData which is a little expression matrix with 26 samples and 500 probe features.
  • We'll generate a hypothesis to test by looking for groupings among the 26 samples
  • To do that we will cluster the samples on the basis of the probe intensities

  • We will use dist() to generate a distance matrix d which is the result of an all against-all-comparison between the rows of a matrix

  • The distance matrix will then be used to perform hierarchical clustering on the rows of our matrix

Generation of the Distance Matrix

  • dist() computes distances for the rows of a matrix
  • Since we want to cluster the samples, which are in the columns of geneData, we will have to feed dist() the transpose of geneData
  • To do that we will use the transpose function, t()
d = dist(t(geneData))  # transpose geneData so the columns (samples) become the rows and compute the distance matrix

Performing the Clustering

We next generate the clustering from d and plot the resulting cluster object, h

h = hclust(d)  # perform the hierarchical clustering and put the result in `h`

plot(h)  # plot the dendrogram

abline(h = 14000, col = "red")  # add a red line to the plot we've already made to show how to split the dendrogram

plot of chunk unnamed-chunk-68

Deriving the Groups from the Tree

  • cutree() cuts the dendrogram to generate groups
groups = cutree(h, h = 14000)

groups  # here are our groups
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 
1 1 2 1 1 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 2 2 1 1 1 1 
groups = factor(groups)

levels(groups)  # check the levels
[1] "1" "2"

Graphically Comparing Groups

  • The box plot for probe 1 shows a distinct difference in intensity between the two groups of arrays
boxplot(geneData[1, ] ~ groups, col = "cyan")

plot of chunk unnamed-chunk-70

Plotting more box plots using layout()

layout(matrix(c(1, 2, 3, 4, 5, 6), 2, 3))  # plot 6 graphs on one page


# loop through probes 1 to 6, boxplot, and give a title based on the probe
# names (rownames)

for (j in 1:6) {
    boxplot(geneData[j, ] ~ groups, main = rownames(geneData)[j], col = "cyan")
}

plot of chunk unnamed-chunk-72

Performing T-tests on the Groups

  • We can run a T-test on the two groups for each of the 500 probes

  • To do so we need a loop and we need to extract the P-value from each t.test() return object we get

T = vector()  # initialize an empty vector


for (j in 1:length(rownames(geneData))) {
    T[j] = t.test(geneData[j, ] ~ groups)$p.value
}

…and plot the results as a histogram of P-values

hist(T, breaks = 50, main = "P-values for Probes", xlab = "P-Value", col = "green")

plot of chunk unnamed-chunk-74

Flagging Significant P-values

  • Testing for \( P<=.05 \) yields a logical array in which the TRUE values correspond to probes with significant P-values

  • Summing over the array gives us the count of significant probes

head(T < 0.05)  # see a bit of the array
[1]  TRUE FALSE  TRUE FALSE FALSE  TRUE
sum(T < 0.05)  # get the sum
[1] 181

Constructing a Heatmap

  • A good way to display the contents of a matrix
  • A good way to analyze the contents of a matrix
  • Supports clustering by both rows and columns
  • For more advanced heatmaps try heatmap3, heatmapPlus

The basic heatmap

heatmap(geneData)

plot of chunk unnamed-chunk-77

With groups color-coded

colors = c("red", "blue")
heatmap(geneData, ColSideColors = colors[groups])

plot of chunk unnamed-chunk-78

Cleaned up a bit and labeled

colors = c("red", "blue")
heatmap(geneData, ColSideColors = colors[groups], Rowv = NA, xlab = "Sample", 
    ylab = "ProbeId", labRow = FALSE)

plot of chunk unnamed-chunk-79

Unscaled

colors = c("red", "blue")

heatmap(geneData, ColSideColors = colors[groups], Rowv = NA, xlab = "Sample", 
    ylab = "ProbeId", labRow = FALSE, scale = "none")

plot of chunk unnamed-chunk-80

With custom color scheme

S = colorRampPalette(c(rgb(1, 0, 0), rgb(0, 0, 1)))(20)
colors = c("red", "blue")

heatmap(geneData, ColSideColors = colors[groups], Rowv = NA, xlab = "Sample", 
    ylab = "ProbeId", labRow = FALSE, col = S)

plot of chunk unnamed-chunk-81