David Wheeler, PhD, Laboratory of Biochemistry and Molecular Biology, NCI
1/22/2016
What is R?
R is a statistical program that can be used interactively to explore data or used as a programming language.
Bioconductor provides the foundation for almost a thousand libraries that allow biologists to analyze high throughput data.
R-Central, and a Cheat Sheet
The Central Repository for R and its Numerous Libraries:
A Quick Reference PDF:
The Virtues of R
Starting R
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
demo(graphics) # for a good introduction to graphics
The Global Environment
Starting an interactive R session puts you into the global R Environment
You can think of your global R Environment as the equivalent of a virtual disk
Saving Your Work
You can save the Environment at any time during an R session using save.image()
R offers to save this Environment for you whenever you quit
By default, R saves the Environment in the compressed file, .RData
This .RData image is loaded automatically the next time you start R
Project Directories
A common practice among R users is to take advantage of the .RData image files by making a new directory for each R project
In that case, each project has its own .RData file and that file is automatically loaded when you start R in the project directory
A First Look at Some Data
Some packages also come with sample data sets.
To see these, use data()
To load a particular data set, use e.g. data(DNase)
Loading Example Data
data()
loads an example data setls() # there are no variables
character(0)
data(DNase) # import `DNase into your Environment
ls() # the new data set now shows up in the listing of variables
[1] "DNase"
Seeing the Beginning of a Variable
head()
is used to see the first lines of a variablehead(DNase) # see the first 6 lines
Run conc density
1 1 0.04882812 0.017
2 1 0.04882812 0.018
3 1 0.19531250 0.121
4 1 0.19531250 0.124
5 1 0.39062500 0.206
6 1 0.39062500 0.215
Seeing the End of a Variable
tail()
is used to see the last lines of a variabletail(DNase) # see the last 6 lines
Run conc density
171 11 3.125 0.994
172 11 3.125 0.980
173 11 6.250 1.421
174 11 6.250 1.385
175 11 12.500 1.715
176 11 12.500 1.721
Data Summaries
summary()
to get a tabular summarysummary(DNase[, 2:3]) # get a statistical summary
conc density
Min. : 0.04883 Min. :0.0110
1st Qu.: 0.34180 1st Qu.:0.1978
Median : 1.17188 Median :0.5265
Mean : 3.10669 Mean :0.7192
3rd Qu.: 3.90625 3rd Qu.:1.1705
Max. :12.50000 Max. :2.0030
Your Command History
history()
….Rhistory
on exit.history(2) # see your last 2 commands
# save all the commands issued during this session to "h1.Rhistory"
savehistory("h1.Rhistory")
loadhistory("h1.Rhistory") # load a saved history
Ending a Session
quit()
to exit R.quit()
Save workspace image to ~/.RData? [y/n/c]:
Listing Variables
ls() # list variables and functions
[1] "DNase"
NewVar = "abc" # create a new variable
ls() # NewVar appears in the listing
[1] "DNase" "NewVar"
Handling Variables
rm()
to delete it.NewVar # display NewVar
[1] "abc"
rm(NewVar) # remove NewVar
ls() # it is gone
[1] "DNase"
apropos of Help!
# generate some possibilites...
tail(apropos("test"), 12)
[1] "power.t.test" "PP.test"
[3] "prop.test" "prop.trend.test"
[5] "quade.test" "shapiro.test"
[7] "t.test" "testInheritedMethods"
[9] "testPlatformEquivalence" "testVirtual"
[11] "var.test" "wilcox.test"
Help!
help()
documents a function?t.test
help(t.test)
t.test package:stats R Documentation
Student's t-Test
Description:
Performs one and two sample t-tests on vectors of data
Getting an Example
example("t.test")
t.test> require(graphics)
t.test> t.test(1:10, y = c(7:20)) # P = .00001855
Welch Two Sample t-test
data: 1:10 and c(7:20)
t = -5.4349, df = 21.982, p-value = 1.855e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.052802 -4.947198
sample estimates:
mean of x mean of y
5.5 13.5
Packages Extend R
install.packages()
installs a packagelibrary('packagename')
loads an installed packagelibrary()
lists installed packagessearch()
lists the packages that are loadedLoading a Library
library() # see what is available
library("Biobase") # load a library
search() # see what is loaded
[1] ".GlobalEnv" "package:Biobase" "package:BiocGenerics"
[4] "package:parallel" "package:knitr" "package:stats"
[7] "package:graphics" "package:grDevices" "package:utils"
[10] "package:datasets" "package:methods" "Autoloads"
[13] "package:base"
Logging a Session to a File
sink()
redirects console output to a filesink()
is quick way to export data# print to file but also to console; don't overwrite previous output
sink("logfile.log", split = TRUE, append = TRUE)
sink() # close the logfile when you are finished
Running External R Code
ls() # listing shows one variable
[1] "DNase"
# create an R source file
cat("M='hi there'", file = "test.R")
# read and execute the R source file
source("test.R")
ls() # now there are two variables
[1] "DNase" "M"
Data Types
Atomic Types
Atomic Types are the basic elements used in R. The more complex objects (a designation that includes but is not limited to variables and functions) that you will use in your analyses are built from these.
Numerics
10
is a numeric
3.14159
is also a numeric
We see this using class()
.
class(10)
[1] "numeric"
class(3.14159)
[1] "numeric"
Integers
range = 1:10
range
[1] 1 2 3 4 5 6 7 8 9 10
class(range)
[1] "integer"
Missing Values
NA
means “not available”nana = c(1, 2, 3, 4, 5, 6, NA, NA, 9, 10)
is.na(nana) # generates a logical value for each member of nana
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
anyNA(nana) # return `TRUE` if any member of `nana` is NA
[1] TRUE
NA's are Problematic
NA
mean(nana) # NA if any element is NA
[1] NA
sum(nana) # same problem
[1] NA
mean(nana, na.rm = TRUE) # ignores the NA's
[1] 5
Characters
nchar()
to get the length of a stringlength()
returns the length of the variable holding the stringa = c("ABCDEFG", "ABC")
length(a)
[1] 2
nchar(a[1])
[1] 7
nchar(a)
[1] 7 3
Built-in Character Constants
letters
, and LETTERS
.letters[1:10]
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
LETTERS[1:10]
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
Combining Text and Numbers
paste
combines using a separator such as a spacepaste0
combines with no separatorprefix = "I've always said that "
suffix = ", and that's a fact."
paste0(prefix, "pi,", pi, ", is a pre-defined constant in R", suffix)
[1] "I've always said that pi,3.14159265358979, is a pre-defined constant in R, and that's a fact."
paste("Label", 1:10)
[1] "Label 1" "Label 2" "Label 3" "Label 4" "Label 5" "Label 6"
[7] "Label 7" "Label 8" "Label 9" "Label 10"
Logicals
TRUE
and FALSE
are constants with valuesTRUE == 1 # this is TRUE
[1] TRUE
FALSE == 0 # this is also TRUE
[1] TRUE
FALSE == 1 # but this is FALSE
[1] FALSE
AND, OR, NOT
AND
and OR
operators are &
and |
, respectively.!
(10 >= 9 & 10 < 10) | 10 == 10
[1] TRUE
!((10 >= 9 & 10 < 10) | 10 == 10)
[1] FALSE
Logicals as Subscripts
TRUE
elements are keptb = c(1:3, -5:0) # make a little vector
b > 0 # for each 'b' we have one logical value
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
b[b > 0] # return only the positive values in b
[1] 1 2 3
Complex Objects
Object | Composition |
---|---|
vector | an array of elements of the same type |
list | an ray of elements of differing types |
matrix | a vector with dimensions |
data.frame | list with dimensions and elements (the columns) of identical length |
factor | a vector of group identifiers |
function | a self-contained block of R commands |
Vectors
1+1
is a vector of length 1
a[1]
1
c()
combines objects into a vectorb = c(1, 2, -30, 40, -50)
b[c(3, 5)]
[1] -30 -50
Ranges are Vectors
:
2:4 # a vector of integers
[1] 2 3 4
b[2:4] # get items 2,3, and 4 in b
[1] 2 -30 40
Matrices
M[2,5]
…M[10]
dim()
Matrices
M = 1:12 # a vector
# restructure N as a matrix
N = matrix(data = M, nrow = 3, ncol = 4)
dim(N) # get its dimensions
[1] 3 4
N[1:2, ] # see first 2 rows of N
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
Matrices
dim(N) = c(6, 2) # change N's dimensions
N[1:2, ] # see first 2 rows of N
[,1] [,2]
[1,] 1 7
[2,] 2 8
N[10] # treat the matrix N as a vector
[1] 10
Computing Columnwise
colSums(N) # we can take column sums
[1] 21 57
colMeans(N) # we can take column means
[1] 3.5 9.5
Computing on Rows
What are the corresponding methods for rows?
Lists
A list differs from a vector in that its elements can contain a combination of (vectors, matrices, data.frames, or other lists)
You can address the elements of a list using two operators; [[]]
, and $
The first form is similar to a vector subscript and takes a variable between the brackets
The $
allows access to the elements by name
Constructing a List
L = list() # use the constructor function
L$first = c(1, 2, 3) # add 'first' to the list
L[[2]] = c(4, 5, 6) # add a second un-named item
L #take a look
$first
[1] 1 2 3
[[2]]
[1] 4 5 6
List Elements Have Names
names(L) # the names in 'L'
[1] "first" ""
names(L)[2] = "second" # name this element
L # take a look
$first
[1] 1 2 3
$second
[1] 4 5 6
L$second # access the second element
[1] 4 5 6
Data Frames
A data.frame is a list consisting of elements, treated as columns, of identical length.
Individual elements within a data.frame can be accessed using the double subscript notation of the matrix within the double square brackets of the list
Creating and Handling Data Frames
D = data.frame(A = 1:10, twiceA = 2 * 1:10)
D[[2]] # get element 2 which is column 2
[1] 2 4 6 8 10 12 14 16 18 20
names(D) # get the names of 'D'
[1] "A" "twiceA"
D$twiceA # get column 'twiceA'
[1] 2 4 6 8 10 12 14 16 18 20
Custom Objects
lm()
is an exampleData for Our Linear Model
rnorm()
to add noise to Y# generate 1000 random deviates with Standard Deviation of 100
noise = rnorm(1000, mean = 0, sd = 100)
x = 1:1000 # generate a vector of 1000 integers
y = 1:1000 + noise # same as X but with noise
Plot the Data
plot(y ~ x) # plot y 'as.a.function.of' x
Generate the Linear Model Object
~
) LF = lm(y ~ x) # perform a linear fit
class(LF) # see the class of LinFit
[1] "lm"
The Components of LF
names(LF) # see the elements of LinFit
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
LF$coefficients
(Intercept) x
-1.184735 1.009575
Get the Slope and Intercept Terms
plot(y ~ x) # must call plot() for setup
# draw the best fit line
abline(a = LF$coefficients[1], b = LF$coefficients[2], col = "red", lwd = 6)
Factors
levels()
of a factor give the number of groupsF = factor(c("A", "B", "B", "A")) # construct a factor
F # take a look
[1] A B B A
Levels: A B
levels(F) # get the levels of F
[1] "A" "B"
Using Factors in a T-test
group
column becomes a factor# create a data.frame with two columns called 'score' and 'group'
D = data.frame(score = c(100, 110, 95, 70, 75, 65), group = c("A", "A", "A",
"B", "B", "B"))
class(D$group) # what has happened?
[1] "factor"
Running the T-test
~
operator used in the t.test()
function defines the model to be tested–that score is a function of group.T = t.test(D$score ~ D$group) # run it
names(T)
[1] "statistic" "parameter" "p.value" "conf.int" "estimate"
[6] "null.value" "alternative" "method" "data.name"
T$p.value
[1] 0.006133406
Changing the Levels
levels(D$group) # remind ourselves of the levels
[1] "A" "B"
levels(D$group) = c("control", "exp1") #change the levels
levels(D$group) #verify the change
[1] "control" "exp1"
A Corresponding Box Plot
boxplot(D$score ~ D$group)
Functions
x
y
Fsq = function(x) {
y = x^2
return(y)
}
Fsq(3)
[1] 9
Procedures in R
Writing and Reading Data
read.table()
reads tabular data write.table()
writes tabular dataread.csv()
and read.delim()
correspond to read.table()
but with different default parameters.Writing and Reading a Matrix
data(geneData) # load `geneData`
geneData[1, 1:3] # see a little of it
A B C
192.7420 85.7533 176.7570
write.table(geneData, file = "gd") # write
GD = read.table(file = "gd") # read
GD[1, 1:3] # take a look
A B C
AFFX-MurIL2_at 192.742 85.7533 176.757
What happened?
Exploring an Object
class()
,str()
, length()
, dim()
, and object.size
help you take the measure of an objectM = data.frame(A = 1:5, B = 2 * (1:5), C = 2^(1:5))
class(M) # get its class
[1] "data.frame"
object.size(M) # find out how much memory it is using
1024 bytes
Exploring an Object
length(M) # get the length
[1] 3
dim(M) # get the dimensions
[1] 5 3
str(M) # see its internal structure
'data.frame': 5 obs. of 3 variables:
$ A: int 1 2 3 4 5
$ B: num 2 4 6 8 10
$ C: num 2 4 8 16 32
Exploring an Object
head(M, 2) # see the first 2 lines
A B C
1 1 2 2
2 2 4 4
tail(M, 2) # see the last 2 lines
A B C
4 4 8 16
5 5 10 32
Exploring an Object
names(M) # only for lists
[1] "A" "B" "C"
colnames(M) # get the column names
[1] "A" "B" "C"
rownames(M) # get the rownames
[1] "1" "2" "3" "4" "5"
Exploring an Object
dimnames(M) # get both row names and column names
[[1]]
[1] "1" "2" "3" "4" "5"
[[2]]
[1] "A" "B" "C"
Why is the length of a data.frame equal to the number of columns?
Summarizing Data
summary()
, min()
, max()
, range()
, mean()
, sd()
, and others summarize datasummary(M) # a statistical summary
A B C
Min. :1 Min. : 2 Min. : 2.0
1st Qu.:2 1st Qu.: 4 1st Qu.: 4.0
Median :3 Median : 6 Median : 8.0
Mean :3 Mean : 6 Mean :12.4
3rd Qu.:4 3rd Qu.: 8 3rd Qu.:16.0
Max. :5 Max. :10 Max. :32.0
Why does this happen if we try to take the mean of all columns of
M
?
mean(M)
[1] NA
Editing Data
fix(M) # invoke the editor, make your changes and exit
edit(M) # invoke the editor but discard any changes on exit
N=edit(M) # invoke the editor, but assign the changed version to another variable
Other Editors
M=pico(M) # invoke "pico", make changes and save them into "M" again on exit
M=emacs(M) # invoke the 'emacs' editor, and save your changes on exit
Dealing with Missing Data
nana = c(1, 2, 3, 4, 5, 6, NA, NA, 9, 10) # nana has to NA's
sum(nana) # returns NA
[1] NA
One way to fix this, as we've seen:
sum(nana, na.rm = TRUE) # we can ignore NA's on a per function basis
[1] 40
Replacing Missing Data
nana[is.na(nana)] = 0 # we can also change them into something else
nana # take a look--the NA's are now zeros
[1] 1 2 3 4 5 6 0 0 9 10
Why does this method work?
Restructuring Data
M = data.frame(A = 1:3, B = 101:103, C = 31:33)
# rearrange the columns
M = data.frame(C = M$C, A = M$A, B = M$B)
Building by Columns
cbind()
joins objects together by columnsN = cbind(C = M$C, A = M$A, B = M$B) # use `cbind()` to bind the colums together
head(N) # take a look
C A B
[1,] 31 1 101
[2,] 32 2 102
[3,] 33 3 103
Building by Rows
rbind()
joins objects together by rowsN = rbind(N, c(1000, 2000, 3000)) # add another row
tail(N)
C A B
[1,] 31 1 101
[2,] 32 2 102
[3,] 33 3 103
[4,] 1000 2000 3000
Removing Columns from a Matrix
head(N, 2) # see the first two lines
C A B
[1,] 31 1 101
[2,] 32 2 102
N = N[, -1] # delete column 1
head(N, 2) # verify the deletion
A B
[1,] 1 101
[2,] 2 102
Removing Rows from a Matrix
head(N) # take a look at 'N'
A B
[1,] 1 101
[2,] 2 102
[3,] 3 103
[4,] 2000 3000
N = N[c(-4, -3, -2), ]
head(N)
A B
1 101
Labeling Data
names()
, colnames()
, and rownames()
are used both to get and set namesM = data.frame(A = 1:5, B = 21:25)
rownames(M) # the row names are now numbers
[1] "1" "2" "3" "4" "5"
rownames(M) = paste("gene", 1:5, sep = "_")
rownames(M) # verify the replacement
[1] "gene_1" "gene_2" "gene_3" "gene_4" "gene_5"
Subsetting Data
H = head(M, 2) # the first two rows, all columns
T = tail(M, 2) # the last two rows, all columns
N = M[5, ] # row 5, all columns
N = M[, 2] # column 2, all rows
N = M[1:3, ] # rows 1-3, all columns
N = M[c(1, 5, 9), ] # all columns, rows 1, 5, and 9
Transforming Rows or Columns
The apply()
function is a general way to apply any function to the rows or the columns of a matrix
or data.frame
Similar functions lapply()
and sapply()
can be used on the elements of a list
GD = apply(geneData, 1, median) # medians of the rows (second parameter=1)
GD = apply(geneData, 2, median) # medians of the columns (second parameter=2)
Graphing Data
data(geneData)
layout(matrix(c(1:4), 2, 2)) # use a layout matrix
matrix(c(1:4), 2, 2) # here is the layout matrix
[,1] [,2]
[1,] 1 3
[2,] 2 4
Set the Layout and Run the Loop
layout(matrix(c(1:4), 2, 2)) # use a layout matrix
for (j in 1:4) {
# let j vary over the range 1 to 4
barplot(geneData[, j], main = paste("Array", j))
}
A More Complex Layout
# here is the layout matrix we'll use
matrix(c(1, 1, 2, 1, 1, 3, 4, 5, 6), 3, 3)
[,1] [,2] [,3]
[1,] 1 1 4
[2,] 1 1 5
[3,] 2 3 6
Set the Layout and Run the Loop
layout(matrix(c(1, 1, 2, 1, 1, 3, 4, 5, 6), 3, 3))
for (j in 1:6) {
# let j vary over the range 1 to 6
barplot(geneData[, j], main = paste("Array", j)) # barplot the data in all the rows of column 'j'
}
Writing Graphics to a File
png("plot1", width = 500, height = 500) # open a graphics device that will put subsequent graphics into 'plot1'
dev.off() # write the graphics file and close the graphics device
Project 1: A Simple Analysis of Probe Intensity Data
geneData
which is a little expression matrix with 26 samples and 500 probe features.To do that we will cluster the samples on the basis of the probe intensities
We will use dist()
to generate a distance matrix d
which is the result of an all against-all-comparison between the rows of a matrix
The distance matrix will then be used to perform hierarchical clustering on the rows of our matrix
Generation of the Distance Matrix
dist()
computes distances for the rows of a matrixgeneData
, we will have to feed dist()
the transpose of geneData
t()
d = dist(t(geneData)) # transpose geneData so the columns (samples) become the rows and compute the distance matrix
Performing the Clustering
We next generate the clustering from d
and plot the resulting cluster object, h
h = hclust(d) # perform the hierarchical clustering and put the result in `h`
plot(h) # plot the dendrogram
abline(h = 14000, col = "red") # add a red line to the plot we've already made to show how to split the dendrogram
Deriving the Groups from the Tree
cutree()
cuts the dendrogram to generate groupsgroups = cutree(h, h = 14000)
groups # here are our groups
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 1 2 1 1 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 2 2 1 1 1 1
groups = factor(groups)
levels(groups) # check the levels
[1] "1" "2"
Graphically Comparing Groups
1
shows a distinct difference in intensity between the two groups of arraysboxplot(geneData[1, ] ~ groups, col = "cyan")
Plotting more box plots using layout()
layout(matrix(c(1, 2, 3, 4, 5, 6), 2, 3)) # plot 6 graphs on one page
# loop through probes 1 to 6, boxplot, and give a title based on the probe
# names (rownames)
for (j in 1:6) {
boxplot(geneData[j, ] ~ groups, main = rownames(geneData)[j], col = "cyan")
}
Performing T-tests on the Groups
We can run a T-test on the two groups for each of the 500 probes
To do so we need a loop and we need to extract the P-value from each t.test()
return object we get
T = vector() # initialize an empty vector
for (j in 1:length(rownames(geneData))) {
T[j] = t.test(geneData[j, ] ~ groups)$p.value
}
…and plot the results as a histogram of P-values
hist(T, breaks = 50, main = "P-values for Probes", xlab = "P-Value", col = "green")
Flagging Significant P-values
Testing for \( P<=.05 \) yields a logical array in which the TRUE
values correspond to probes with significant P-values
Summing over the array gives us the count of significant probes
head(T < 0.05) # see a bit of the array
[1] TRUE FALSE TRUE FALSE FALSE TRUE
sum(T < 0.05) # get the sum
[1] 181
Constructing a Heatmap
heatmap3
, heatmapPlus
The basic heatmap
heatmap(geneData)
With groups color-coded
colors = c("red", "blue")
heatmap(geneData, ColSideColors = colors[groups])
Cleaned up a bit and labeled
colors = c("red", "blue")
heatmap(geneData, ColSideColors = colors[groups], Rowv = NA, xlab = "Sample",
ylab = "ProbeId", labRow = FALSE)
Unscaled
colors = c("red", "blue")
heatmap(geneData, ColSideColors = colors[groups], Rowv = NA, xlab = "Sample",
ylab = "ProbeId", labRow = FALSE, scale = "none")
With custom color scheme
S = colorRampPalette(c(rgb(1, 0, 0), rgb(0, 0, 1)))(20)
colors = c("red", "blue")
heatmap(geneData, ColSideColors = colors[groups], Rowv = NA, xlab = "Sample",
ylab = "ProbeId", labRow = FALSE, col = S)