David Wheeler, PhD, Laboratory of Biochemistry and Molecular Biology, NCI
11/22/2016
What is Bioconductor?
“Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. It has two releases each year, 1296 software packages, and an active user community.”
Bioconductor Packages by Category
Software (1289) | AnnotationData (933) | ExperimentData (305) |
---|---|---|
AssayDomain (485) | ChipManufacturer (403) | Cancer (72) |
BiologicalQuestion (459) | ChipName (195) | MicrobiomeData (2) |
Infrastructure (230) | CustomArray (2) | RNASeqData (23) |
ResearchField (193) | CustomCDF (16) | ChIPseqData (8) |
StatisticalMethod (318) | Organism (600) | EColiData (2) |
Technology (696) | FunctionalAnnotation (13) | ExpressionData (7) |
WorkflowStep (675) | HapMap (7) |
Installing Bioconductor
# source the R coded installer
source("http://www.bioconductor.org/biocLite.R")
biocLite() # run the installer to get the basic library
biocLite("GenomicRanges") # install a specific library
Loading Biobase Generates a Report
library(Biobase) #load Biobase
In addition to Biobase, two other packages upon which Biobase depends are automatically loaded. These are BiocGenerics and parallel.
Some functions from other packages are “masked”. This means that a recently loaded library defines a function with the same name as that of a previously loaded library–the new function supercedes the old.
Vignettes for a package take you through a typical analysis and are a good way to get acquainted with the package.
Biobase comes with Sample Data
data(package="Biobase")
Data sets in package ‘Biobase’:
SW Class to Contain High-Throughput Assays and
Experimental Metadata
aaMap Dataset: Names and Characteristics of Amino
Acids
geneCov Sample expression matrix and phenotype
data.frames.
geneCovariate Sample expression matrix and phenotype
data.frames.
geneData Sample expression matrix and phenotype
data.frames.
reporter Example data.frame representing reporter
information
sample.ExpressionSet Dataset of class 'ExpressionSet'
sample.MultiSet Data set of class 'MultiSet'
seD Sample expression matrix and phenotype
data.frames.
Let's Load a Small Dataset...
data("aaMap") # load 'aaMap'
aaMap[1:2, ] # take a look
name let.1 let.3 scProp hyPhilic acidic
1 alanine A ala nonpolar FALSE NA
2 cysteine C cys polar NA NA
table(aaMap$scProp) # tabulate side chain properties
nonpolar polar
9 11
Partial Listing Biobase Functions
ls("package:Biobase")[c(6, 12, 125, 101, 89, 91, 71, 49, 51, 45, 41, 44)] # get a listing of some of the methods of Biobase
[1] "AnnotatedDataFrame" "assayData" "varMetadata"
[4] "sampleNames" "protocolData" "pubMedIds"
[7] "MIAME" "featureData" "featureNames"
[10] "exprs" "experimentData" "ExpressionSet"
The ExpressionSet
Required format for most functions dealing with microarray data
Objects with very similar structure used for other types of data (e.g. SummarizedExperiment
for NGS data, MRexperiment
for microbiome data)
So, let's build one from basic components
Constructing an ExpressionSet
ExpressionSet
is used to contain microarray data. assayData (exprs) | phenoData | experimentData | annotation |
---|---|---|---|
environment | AnnotatedDataFrame | MIAME object | character |
expression data | sample information | description of experiment | name the matching annotation database |
featureData | protocolData | classVersion |
---|---|---|
AnnotatedDataFrame | AnnotatedDataFrame | Versions |
feature information | protocol information |
The Plan
Biobase
to build an ExpressionSet
from scratch using 4 of the 7 slotsfeatureData
and protocolData
for the sake of brevity.classVersion
slot is filled in automatically and need not concern us. First Component: Expression Matrix
geneData
for the first component.data(geneData) # a sample expression matrix
colnames(geneData) # the sample names
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
geneData[1:2, 1:3] # the matrix itself
A B C
AFFX-MurIL2_at 192.742 85.7533 176.7570
AFFX-MurIL10_at 97.137 126.1960 77.9216
Second Component: Phenotype Data
data.frames
we use to build the AnnotatedDataFrame
of the phenoData
slot.data(geneCov) # the sample information
head(geneCov) # take a look at what we have
cov1 cov2 cov3
A 1 1 1
B 1 1 1
C 1 1 1
D 1 1 1
E 1 2 1
F 1 2 1
How can we change cov1 to 'gender', cov2 to “case”, and cov3 to “type”?
Naming the Covariates
names()
function to assign new names to the covariatesnames(geneCov)
[1] "cov1" "cov2" "cov3"
names(geneCov) = c("gender", "case", "type")
names(geneCov)
[1] "gender" "case" "type"
Annotating the Phenodata
data.frame
to the phenotype datadata.frame
constructor is adding 3 columns to the frame by name while the second is labeling the rows.# Here we describe the columns--this is the second data.frame of phenData
metadata = data.frame(labelDescription = c("patient gender", "Case/control",
"Tumor type"), row.names = c("gender", "case", "type"))
metadata
labelDescription
gender patient gender
case Case/control
type Tumor type
The Phenodata Constructor
phenoData
component which is an AnnotatedDataFrame
Annotated DataFrame
consists of one data.frame
to hold the covariate data and a second that describes the covariates. new()
function is a constructor, taking as its first argument the type of object to construct.# construct the `phenoData` component.
phenoData = new("AnnotatedDataFrame", data = geneCov, varMetadata = metadata)
phenoData
An object of class 'AnnotatedDataFrame'
rowNames: A B ... Z (26 total)
varLabels: gender case type
varMetadata: labelDescription
Third Component: Annotation
hgu95av2
annotation database.It provides mappings between probe identifiers and biological data
hgu95av2ACCNUM | hgu95av2CHR | hgu95av2CHRLOC | hgu95av2ENZYME |
hgu95av2GENENAME | hgu95av2GO | hgu95av2MAP | hgu95av2OMIM |
hgu95av2PFAM | hgu95av2PMID | hgu95av2PROSITE | hgu95av2REFSEQ |
hgu95av2UNIGENE | hgu95av2CHRLENGTHS | hgu95av2ENZYME2PROBE | hgu95av2GO2ALLPROBES |
hgu95av2PATH2PROBE | hgu95av2PMID2PROBE |
This is an Easy One
annotation = "hgu95av2" # add a reference to our annotation database
The MIAMI Component
experimentData = new("MIAME", name = "C. Darwin", lab = "CCR", contact = "email.gov",
title = "Small Experiment", abstract = "An example ExpressionSet", url = "www.lab.gov")
Putting it all Together
Now, we can build the ExpressionSet
from the 4 components using the ExpressionSet()
constructor.
eset = ExpressionSet(assayData = geneData, phenoData = phenoData, experimentData = experimentData,
annotation = annotation)
str(eset)
Formal class 'ExpressionSet' [package "Biobase"] with 7 slots
..@ experimentData :Formal class 'MIAME' [package "Biobase"] with 13 slots
.. .. ..@ name : chr "C. Darwin"
.. .. ..@ lab : chr "CCR"
.. .. ..@ contact : chr "email.gov"
.. .. ..@ title : chr "Small Experiment"
.. .. ..@ abstract : chr "An example ExpressionSet"
.. .. ..@ url : chr "www.lab.gov"
.. .. ..@ pubMedIds : chr ""
.. .. ..@ samples : list()
.. .. ..@ hybridizations : list()
.. .. ..@ normControls : list()
.. .. ..@ preprocessing : list()
.. .. ..@ other : list()
.. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. .. .. ..@ .Data:List of 2
.. .. .. .. .. ..$ : int [1:3] 1 0 0
.. .. .. .. .. ..$ : int [1:3] 1 1 0
..@ assayData :<environment: 0x7f94fbb63378>
..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
.. .. ..@ varMetadata :'data.frame': 3 obs. of 1 variable:
.. .. .. ..$ labelDescription: chr [1:3] "patient gender" "Case/control" "Tumor type"
.. .. ..@ data :'data.frame': 26 obs. of 3 variables:
.. .. .. ..$ gender: num [1:26] 1 1 1 1 1 1 1 1 1 1 ...
.. .. .. ..$ case : num [1:26] 1 1 1 1 2 2 2 2 2 2 ...
.. .. .. ..$ type : num [1:26] 1 1 1 1 1 1 1 1 2 2 ...
.. .. ..@ dimLabels : chr [1:2] "sampleNames" "sampleColumns"
.. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. .. .. ..@ .Data:List of 1
.. .. .. .. .. ..$ : int [1:3] 1 1 0
..@ featureData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
.. .. ..@ varMetadata :'data.frame': 0 obs. of 1 variable:
.. .. .. ..$ labelDescription: chr(0)
.. .. ..@ data :'data.frame': 500 obs. of 0 variables
.. .. ..@ dimLabels : chr [1:2] "featureNames" "featureColumns"
.. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. .. .. ..@ .Data:List of 1
.. .. .. .. .. ..$ : int [1:3] 1 1 0
..@ annotation : chr "hgu95av2"
..@ protocolData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
.. .. ..@ varMetadata :'data.frame': 0 obs. of 1 variable:
.. .. .. ..$ labelDescription: chr(0)
.. .. ..@ data :'data.frame': 26 obs. of 0 variables
.. .. ..@ dimLabels : chr [1:2] "sampleNames" "sampleColumns"
.. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. .. .. ..@ .Data:List of 1
.. .. .. .. .. ..$ : int [1:3] 1 1 0
..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. ..@ .Data:List of 4
.. .. .. ..$ : int [1:3] 3 3 0
.. .. .. ..$ : int [1:3] 2 32 0
.. .. .. ..$ : int [1:3] 1 3 0
.. .. .. ..$ : int [1:3] 1 0 0
Accessor Functions
ExpressionSet
is an object of a new type called 'S4' Such objects provide accessor functions to allow us to get at their data.
The data in the 7 slots is available through the accessor functions shown earlier in the table.
Note that the covariates are listed in the @data
data.frame within the @phenoData
slot and that they are prefixed with a $
sign.
These elements can be reached directly as though they were elements of a list, e.g. eset$gender
.
Analysis of the First Covariate, Gender
eset
.varMetadata
accessor function to see them.varMetadata(eset) # show the metadata about the phenotypes
labelDescription
gender patient gender
case Case/control
type Tumor type
Let's examine the effect of gender on our set of 500 probes.
The Plan
exprs()
function extracts the expression matrix of 500 probes (rows) and 26 samples (columns) from eset
exprs()
to get at the dataPartitioning by Gender
data.frame
in the phenoData
slot of eset
contains 26 entries, one per sample
$
operatorE = exprs(eset) # extract the exression data as a matrix
gender = eset$gender # we can get the gender array this way
gender # take a look
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
The Plan
gender
as a grouping vector for our 26 samplesboxplot
a model for the expression data in the first row that says:E[1,] is.a.function.of gender
The is.a.function.of operator is the tilde ~
The Boxplot
boxplot(E[1, ] ~ gender)
A T-Test by Gender
t.test
of the probe intensitiest.test(E[1, ] ~ gender)
Welch Two Sample t-test
data: E[1, ] by gender
t = -1.0582, df = 23.578, p-value = 0.3007
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-54.97275 17.73203
sample estimates:
mean in group 1 mean in group 2
105.8006 124.4210
Scale up for all 500 Probes
for
loop.T = vector() # initialize an array for P-Values
for (n in 1:500) {
T[n] = t.test(E[n, ] ~ gender)$p.value
} # do just what we did above for each probe and extract the P-value from the t.test object
Did the T-Test Loop Work?
length(T) # just check to make sure we got 500 P-vals
[1] 500
head(T) # take a look
[1] 0.3006994 0.5717812 0.9852404 0.3062817 0.6786697 0.6793440
which(T <= 0.05) #use 'which' to see the indices of the significant cases
[1] 30 85 147 199 279 286 301 376 400
How to Make 9 Boxplots: the Plan
Let's make boxplots for these 9 interesting probes.
We first divide the graphical display into 12 areas
Next, we plot 9 boxplots and label them with probe id and P-Value
The Code
layout(matrix(1:9, 3, 3))
F = featureNames(eset) # if we want to add the feature names to the graph title
for (n in which(T <= 0.05)) {
boxplot(E[n, ] ~ gender, main = paste(F[n], format(T[n], digits = 3)))
} # loop through all cases where the P-val <= 0.05
The 9 Boxplots
Adding Our Own Covariate
geneData
using hierarchical clusteringLet's add those groups to the phenoData
slot of eset
and reanalyze.
First, we'll create the distance matrix needed by hclust()
# transpose the expression matrix to cluster samples (columns) rather than probes (rows)
d=dist(t(E))
h=hclust(d) # perform the clustering
plot(h) # plot the cluster dendrogram
abline(h=14000,col="red")
Here's the Dendrogram
# transpose the expression matrix to cluster samples (columns) rather than
# probes (rows)
d = dist(t(E))
h = hclust(d) # perform the clustering
plot(h) # plot the cluster dendrogram
abline(h = 14000, col = "red")
Cutting the Dendrogram
groups = factor(cutree(h, h = 14000)) # cut the tree at 1400
groups # take a look at which samples fall into which groups
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 1 2 1 1 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 2 2 1 1 1 1
Levels: 1 2
Adding the New Covariate
hc.group
.eset$hc.group = groups # add the new covariate
varMetadata(eset) # verify that we have done what we wanted to do
labelDescription
gender patient gender
case Case/control
type Tumor type
hc.group <NA>
Running the Loop
T = vector()
for (n in 1:500) {
T[n] = t.test(E[n, ] ~ eset$hc.group)$p.value
} # perform the 500 T-tests
T=array();for(n in 1:500){T[n]=t.test(E[n,]~groups)$p.value}
How many probes pass the test?
which(T <= 0.05) # see which pass the test
[1] 1 3 6 8 12 20 22 23 24 25 27 28 31 35 36 38 46
[18] 47 48 51 58 65 67 73 74 82 86 89 91 96 106 110 117 118
[35] 121 122 123 126 128 129 131 134 139 140 146 153 157 158 163 169 174
[52] 177 178 180 181 184 185 186 189 192 193 196 197 201 203 204 205 207
[69] 208 210 214 218 220 222 224 226 232 233 235 236 237 239 242 251 253
[86] 256 262 263 264 265 266 269 270 272 275 276 281 282 288 290 291 295
[103] 296 297 298 299 302 304 306 307 309 312 313 315 316 317 320 326 327
[120] 328 329 333 334 336 341 343 344 346 351 353 355 362 363 369 373 375
[137] 379 381 383 384 387 389 390 392 393 396 397 398 403 404 405 409 410
[154] 423 426 431 432 433 436 437 441 445 446 452 455 458 460 465 467 469
[171] 470 471 474 476 482 483 485 487 494 495 499
Getting Minimum Values
min()
which.min()
:min(T)
[1] 3.268072e-10
which.min(T) # Let's see the index of the probe with the best P-value
[1] 272
Boxplotting the Best Case
boxplot()
this best case to see how good it can get.
boxplot(E[which.min(T),]~eset$hc.group,main=paste("Probe",which.min(T),"P-val:",format(T[which.min(T)],digits=3)))
The Best Case
boxplot(E[which.min(T), ] ~ eset$hc.group, main = paste("Probe", which.min(T),
"P-val:", format(T[which.min(T)], digits = 3)))
Getting the Probe ID for the Best Case
featureNames(eset[which.min(T)])
[1] "31511_at"
Biostrings
The Biostrings library provides a container for DNA, RNA, or protein sequences called a BStringSet. These come in three varieties:
DNA | RNA | Proteins, Peptides |
---|---|---|
DNAStringSet() | RNAStringSet() | AAStringSet() |
Let's load the Biostrings library and take a look at the sample data provided:
library(Biostrings) # load the library 'Biostrings'
Biostrings
Next, let's see if there is any sample data to explore:
data(package = "Biostrings") # list the example data sets of 'Biostrings'
Data sets in package ‘Biostrings’:
BLOSUM100 Scoring matrices
BLOSUM45 Scoring matrices
BLOSUM50 Scoring matrices
BLOSUM62 Scoring matrices
BLOSUM80 Scoring matrices
HNF4alpha Known HNF4alpha binding sequences
PAM120 Scoring matrices
PAM250 Scoring matrices
PAM30 Scoring matrices
PAM40 Scoring matrices
PAM70 Scoring matrices
phiX174Phage Versions of bacteriophage phiX174
complete genome and sample short reads
...
Biostrings
To try out some functions found in Biostrings, load the phiX174Phage
stringset.
data(phiX174Phage) # load one of the sample data sets
phiX174Phage # see how it looks
A DNAStringSet instance of length 6
width seq names
[1] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA Genbank
[2] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA RF70s
[3] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA SS78
[4] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA Bull
[5] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA G97
[6] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA NEB03
Biostrings
Remember that you can determine an object's class using class()
.
class(phiX174Phage) # determine the object class of 'phiX174Phage'
[1] "DNAStringSet"
attr(,"package")
[1] "Biostrings"
Biostrings
names(phiX174Phage)
[1] "Genbank" "RF70s" "SS78" "Bull" "G97" "NEB03"
phiX174Phage$Genbank
5386-letter "DNAString" instance
seq: GAGTTTTATCGCTTCCATGACGCAGAAGTTAACA...TTCGATAAAAATGATTGGCGTATCCAACCTGCA
Biostrings: Creating a DNAString object
Here we make a DNAString to contain the sequence of exon 4 of the human oncogene homologue (c-abl) of the abelson murine leukemia virus.
abl1 = DNAString("GATCTTGCTGCCCGAAACTGCCTGGTAGGGGAGAACCACTTGGTGAAGGTAGCTGATTTTGGCCTGAGCAGGTTGATGACAGGGGACACCTACACAGCCCATGCTGGAGCCAAGTTCCCCATCAAATGGACTGCACCCGAGAGCCTGGCCTACAACAAGTTCTCCATCAAGTCCGACGTCTGGGGTAAGGGC") # use a 'constructor' to make a DNAString
abl1 # take a look at the new object
192-letter "DNAString" instance
seq: GATCTTGCTGCCCGAAACTGCCTGGTAGGGGAGA...TTCTCCATCAAGTCCGACGTCTGGGGTAAGGGC
RNAString(abl1) # Now, feed the new DNAString into RNAString to convert
192-letter "RNAString" instance
seq: GAUCUUGCUGCCCGAAACUGCCUGGUAGGGGAGA...UUCUCCAUCAAGUCCGACGUCUGGGGUAAGGGC
Biostrings: Basic Functions
reverse(abl1) # reverse the sequence
192-letter "DNAString" instance
seq: CGGGAATGGGGTCTGCAGCCTGAACTACCTCTTG...GAGGGGATGGTCCGTCAAAGCCCGTCGTTCTAG
complement(abl1) # generate the complement of the sequence
192-letter "DNAString" instance
seq: CTAGAACGACGGGCTTTGACGGACCATCCCCTCT...AAGAGGTAGTTCAGGCTGCAGACCCCATTCCCG
reverseComplement(abl1) # generate the reverse complement
192-letter "DNAString" instance
seq: GCCCTTACCCCAGACGTCGGACTTGATGGAGAAC...CTCCCCTACCAGGCAGTTTCGGGCAGCAAGATC
Biostrings: Basic Functions
abl1.aas = translate(abl1) # translate the sequence
abl1.aas
64-letter "AAString" instance
seq: DLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWGKG
length(abl1.aas)
[1] 64
Biostrings: alphabetFrequency()
The function alphabetFrequency
looks interesting, so let's try it.
a = alphabetFrequency(phiX174Phage) # generate a table of base counts
a
A C G T M R W S Y K V H D B N - + .
[1,] 1291 1157 1254 1684 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,] 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[4,] 1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[5,] 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[6,] 1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Can we add rownames to this output?
Biostrings: alphabetFrequency()
names(phiX174Phage)
[1] "Genbank" "RF70s" "SS78" "Bull" "G97" "NEB03"
rownames(a) = names(phiX174Phage)
a
A C G T M R W S Y K V H D B N - + .
Genbank 1291 1157 1254 1684 0 0 0 0 0 0 0 0 0 0 0 0 0 0
RF70s 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SS78 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Bull 1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G97 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
NEB03 1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0
How do we use this matrix?
(a[, "G"] + a[, "C"])/(a[, "G"] + a[, "C"] + a[, "A"] + a[, "T"]) # use the variable
Genbank RF70s SS78 Bull G97 NEB03
0.4476420 0.4472707 0.4472707 0.4470850 0.4472707 0.4470850
We can easily take a couple more steps.
dinucleotideFrequency(phiX174Phage) # get a table of dinucleotide counts
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
[1,] 395 261 251 383 257 229 267 404 327 347 255 325 312 320 480 572
[2,] 394 260 253 384 258 229 265 404 327 347 254 325 313 320 480 572
[3,] 394 260 253 384 258 229 265 404 327 347 254 325 313 320 480 572
[4,] 395 260 252 384 257 229 266 403 327 346 254 326 313 320 480 573
[5,] 395 261 253 382 257 229 266 404 326 346 254 327 314 320 479 572
[6,] 394 260 253 384 258 227 265 405 327 347 254 325 313 321 480 572
trinucleotideFrequency(phiX174Phage) # get a table of trinucleotide counts
AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA
[1,] 133 67 93 102 47 62 63 89 72 53 74 52 60 56 140 127 91
[2,] 133 66 93 102 47 62 62 89 73 53 75 52 60 56 141 127 91
[3,] 133 66 93 102 47 62 62 89 73 53 75 52 60 56 141 127 91
[4,] 133 66 93 103 47 62 62 89 73 53 74 52 60 56 140 128 91
[5,] 133 66 93 103 47 62 62 90 73 53 75 52 60 56 139 127 91
[6,] 133 66 93 102 47 62 62 89 73 53 75 52 60 56 141 127 91
CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC
[1,] 40 61 64 59 38 70 62 52 92 31 92 78 67 132 127 70 82
[2,] 40 62 64 59 38 70 62 52 92 30 91 78 67 132 127 69 82
[3,] 40 62 64 59 38 70 62 52 92 30 91 78 67 132 127 69 82
[4,] 39 62 64 58 38 71 62 52 92 30 92 78 67 132 126 70 82
[5,] 40 62 63 58 38 71 62 52 92 30 92 78 67 132 127 70 82
[6,] 40 62 64 59 36 69 63 52 92 30 91 78 68 132 127 69 82
GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG
[1,] 74 101 60 71 67 149 63 72 20 100 54 65 66 140 101 72 23
[2,] 74 102 61 71 66 149 63 72 19 100 55 65 65 140 101 72 24
[3,] 74 102 61 71 66 149 63 72 19 100 55 65 65 140 101 72 24
[4,] 74 101 61 71 66 148 62 72 20 100 54 65 67 140 101 73 23
[5,] 74 100 61 71 66 148 62 72 19 101 55 65 67 140 101 73 24
[6,] 74 102 61 71 66 149 63 72 19 100 55 65 65 140 101 72 24
TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT
[1,] 116 91 58 67 104 139 130 130 81 120 132 142 178
[2,] 116 91 58 67 104 138 130 130 82 120 132 142 178
[3,] 116 91 58 67 104 138 130 130 82 120 132 142 178
[4,] 116 91 58 67 104 139 129 130 82 121 132 141 179
[5,] 116 91 58 67 104 138 129 130 82 121 132 141 178
[6,] 116 91 58 68 104 138 130 130 82 120 132 142 178
O = oligonucleotideFrequency(phiX174Phage, 5)
hist(O)
BSgenome
The BSgenome library provides a container for genomes. A number of genomes have been packaged for Bioconductor.
library(BSgenome) # load the library 'BSgenome'
available.genomes()[1:20] # get a list of genomes you can install
[1] "BSgenome.Alyrata.JGI.v1"
[2] "BSgenome.Amellifera.BeeBase.assembly4"
[3] "BSgenome.Amellifera.UCSC.apiMel2"
[4] "BSgenome.Amellifera.UCSC.apiMel2.masked"
[5] "BSgenome.Athaliana.TAIR.04232008"
[6] "BSgenome.Athaliana.TAIR.TAIR9"
[7] "BSgenome.Btaurus.UCSC.bosTau3"
[8] "BSgenome.Btaurus.UCSC.bosTau3.masked"
[9] "BSgenome.Btaurus.UCSC.bosTau4"
[10] "BSgenome.Btaurus.UCSC.bosTau4.masked"
[11] "BSgenome.Btaurus.UCSC.bosTau6"
[12] "BSgenome.Btaurus.UCSC.bosTau6.masked"
[13] "BSgenome.Btaurus.UCSC.bosTau8"
[14] "BSgenome.Celegans.UCSC.ce10"
[15] "BSgenome.Celegans.UCSC.ce11"
[16] "BSgenome.Celegans.UCSC.ce2"
[17] "BSgenome.Celegans.UCSC.ce6"
[18] "BSgenome.Cfamiliaris.UCSC.canFam2"
[19] "BSgenome.Cfamiliaris.UCSC.canFam2.masked"
[20] "BSgenome.Cfamiliaris.UCSC.canFam3"
Let's load the C. elegans genome to experiment.
library(BSgenome.Celegans.UCSC.ce6) # load your local copy of the C. elegans genome
What can we do with BSgenomes?
Use ls("package:BSgenome") to see
ls("package:BSgenome")[1:20] # list the methods of 'BSgenome'
[1] "alphabetFrequency" "as.data.frame"
[3] "as.data.frame.BSgenomeViews" "as.list"
[5] "available.genomes" "available.SNPs"
[7] "blocksizes" "breakpoints"
[9] "bsapply" "BSgenome"
[11] "BSgenomeViews" "coerce"
[13] "colnames" "commonName"
[15] "compatibleGenomes" "consensusMatrix"
[17] "consensusString" "countPWM"
[19] "elementNROWS" "end"
BSgenome.Celegans.UCSC.ce6 # see some header information
Worm genome:
# organism: Caenorhabditis elegans (Worm)
# provider: UCSC
# provider version: ce6
# release date: May 2008
# release name: WormBase v. WS190
# 7 sequences:
# chrI chrII chrIII chrIV chrV chrX chrM
# (use 'seqnames()' to see all the sequence names, use the '$' or '[['
# operator to access a given sequence)
BSgenome.Celegans.UCSC.ce6[["chrI"]] # display chromosome I
15072421-letter "DNAString" instance
seq: GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT...GGCTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC
seqinfo(BSgenome.Celegans.UCSC.ce6) # get a summary of the sequences included
Seqinfo object with 7 sequences (1 circular) from ce6 genome:
seqnames seqlengths isCircular genome
chrI 15072421 FALSE ce6
chrII 15279323 FALSE ce6
chrIII 13783681 FALSE ce6
chrIV 17493785 FALSE ce6
chrV 20919568 FALSE ce6
chrX 17718854 FALSE ce6
chrM 13794 TRUE ce6
seqnames(BSgenome.Celegans.UCSC.ce6) # get only the seq names
[1] "chrI" "chrII" "chrIII" "chrIV" "chrV" "chrX" "chrM"
seqlengths(BSgenome.Celegans.UCSC.ce6) # get an array of the sequence lengths
chrI chrII chrIII chrIV chrV chrX chrM
15072421 15279323 13783681 17493785 20919568 17718854 13794
barplot(seqlengths(BSgenome.Celegans.UCSC.ce6)) # generate a quick barplot
What is the GC content of the C. elegans genome chromosome I?
F = alphabetFrequency(BSgenome.Celegans.UCSC.ce6[["chrI"]]) # get the frequency table
F # take look at the table
A C G T M R W S Y
4835939 2695879 2692150 4848453 0 0 0 0 0
K V H D B N - + .
0 0 0 0 0 0 0 0 0
sum(F[c("G", "C")])/sum(F[c("A", "T", "C", "G")]) * 100 # compute % GC
[1] 35.7476
Exercise: How can we comput %GC for all chromosomes?
Fr = vector() # initialize a vector to hold the %GC values
for (s in seqnames(BSgenome.Celegans.UCSC.ce6)) {
# loop through each of the 'seqnames'--these are the chromosomes
F = alphabetFrequency(BSgenome.Celegans.UCSC.ce6[[s]])
# get a table for one chromosome,compute %GC and store this in 'Fr[s]'
Fr[s] = sum(F[c("G", "C")])/sum(F[c("A", "T", "C", "G")]) * 100
}
Fr # display Fr to see if we got it right
chrI chrII chrIII chrIV chrV chrX chrM
35.74760 36.20196 35.66144 34.59386 35.42942 35.20301 23.77845
barplot(Fr, ylab = "%GC", xlab = "C. elegans Chromosome", main = "GC Percentage by Chromosome") # barplot it!
GenomicFeatures
The GenomicFeatures library provides the functions for the extraction of the sequences of features such as genes, exons, introns, and promoters from BSgenomes.
Let's take a look at the C. elegans genome to start.
library(TxDb.Celegans.UCSC.ce6.ensGene) # load the C. elegans transcript database
By loading this transcript database, we have also loaded a new library called GenomicFeatures
that is required to use the database. What new functions have we acquired?
ls("package:GenomicFeatures")[1:40] # list the methods of 'GenomicFeatures'
[1] "as.list" "asBED"
[3] "asGFF" "cds"
[5] "cdsBy" "cdsByOverlaps"
[7] "coverageByTranscript" "DEFAULT_CIRC_SEQS"
[9] "disjointExons" "distance"
[11] "exons" "exonsBy"
[13] "exonsByOverlaps" "extractTranscriptSeqs"
[15] "extractUpstreamSeqs" "features"
[17] "fiveUTRsByTranscript" "genes"
[19] "getChromInfoFromBiomart" "getChromInfoFromUCSC"
[21] "getPromoterSeq" "id2name"
[23] "intronsByTranscript" "isActiveSeq"
[25] "isActiveSeq<-" "makeFDbPackageFromUCSC"
[27] "makeFeatureDbFromUCSC" "makePackageName"
[29] "makeTxDb" "makeTxDbFromBiomart"
[31] "makeTxDbFromGFF" "makeTxDbFromGRanges"
[33] "makeTxDbFromUCSC" "makeTxDbPackage"
[35] "makeTxDbPackageFromBiomart" "makeTxDbPackageFromUCSC"
[37] "mapFromTranscripts" "mapIdsToRanges"
[39] "mapRangesToIds" "mapToTranscripts"
The functions exons()
, genes()
, and exons.By()
look interesting so let's try them out:
ce.exons = exons(TxDb.Celegans.UCSC.ce6.ensGene) # get the exons coordinates
head(ce.exons, 2) # take a look
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | exon_id
<Rle> <IRanges> <Rle> | <integer>
[1] chrI [11495, 11561] + | 1
[2] chrI [11499, 11557] + | 2
-------
seqinfo: 7 sequences (1 circular) from ce6 genome
ce.genes = genes(TxDb.Celegans.UCSC.ce6.ensGene) # get the gene coordinates
head(ce.genes, 2) #take a look
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
2L52.1 chrII [ 1867, 4663] + | 2L52.1
2RSSE.1 chrII [15268114, 15273216] + | 2RSSE.1
-------
seqinfo: 7 sequences (1 circular) from ce6 genome
The gene_ids in the first metadata column are ENSEMBL ids and can be used to search in the UCSC Genome Browser.
# get the exons with their gene assignments
ce.exons.bygene = exonsBy(TxDb.Celegans.UCSC.ce6.ensGene, by = c("gene"))
head(ce.exons.bygene, 1) # take a look
GRangesList object of length 1:
$2L52.1
GRanges object with 7 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chrII [1867, 1911] + | 21558 <NA>
[2] chrII [2506, 2694] + | 21559 <NA>
[3] chrII [2738, 2888] + | 21560 <NA>
[4] chrII [2931, 3036] + | 21561 <NA>
[5] chrII [3406, 3552] + | 21562 <NA>
[6] chrII [3802, 3984] + | 21563 <NA>
[7] chrII [4201, 4663] + | 21564 <NA>
-------
seqinfo: 7 sequences (1 circular) from ce6 genome
GenomicRanges
The GenomicRanges library provides the framework for feature extraction and analysis. Among its most powerful functions are those used ot determine feature overlap. We load the library as usual:
library("GenomicRanges") # load the library 'GenomicRanges'
And we list the newly acquired functions in the usual manner:
ls("package:GenomicRanges")[1:20] # list the methods of 'GenomicRanges'
[1] "absoluteRanges" "as.data.frame"
[3] "as.factor" "bindAsGRanges"
[5] "binnedAverage" "checkConstraint"
[7] "coerce" "countOverlaps"
[9] "coverage" "disjoin"
[11] "disjointBins" "distance"
[13] "distanceToNearest" "duplicated"
[15] "duplicated.GenomicRanges" "elementMetadata"
[17] "elementMetadata<-" "end"
[19] "end<-" "findOverlaps"
Let's try out a few
length(ce.exons) # how many exons are there in C. elegans?
[1] 146833
head(width(ce.exons)) # get the lengths for the first 6 exons
[1] 67 59 63 57 72 67
Take a free histogram
hist(width(ce.exons), main = "Distribution of Exon Lengths for C. Elegans")
The histogram looks a bit cramped. How can we correct this?
hist(log10(width(ce.exons)), main = "Distribution of Exon Lengths for C. Elegans")
GenomicAlignments
The GenomicAlignments library provides containers for storing and manipulating short genomic alignments (typically obtained by aligning short reads to a reference genome). This includes read counting, computing the coverage, junction detection, and working with the nucleotide sequences of the alignments.
library(GenomicAlignments)
Exercise: What new functions have we acquired?
bam <- readGAlignments(system.file("extdata", "ex1.bam", package = "Rsamtools"))
bam[1:2]
GAlignments object with 2 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end width
<Rle> <Rle> <character> <integer> <integer> <integer> <integer>
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
njunc
<integer>
[1] 0
[2] 0
-------
seqinfo: 2 sequences from an unspecified genome
B7_591:4:96:693:509 73 seq1 1 99 36M * 0 0
CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7
MF:i:18 Aq:i:73 NM:i:0 UQ:i:0 H0:i:1 H1:i:0
EAS54_65:7:152:368:113 73 seq1 3 99 35M * 0 0
CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6): MF:i:18 Aq:i:66 NM:i:0 UQ:i:0 H0:i:1 H1:i:0
Cigar Strings
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Reference: C C A T A C T G A A C T G A C T A A C
Read: A C T A G A A T G G C T
With the alignment above, you get:
POS: 5
CIGAR: 3M1I3M1D5M
sort(table(cigar(bam)), decreasing = TRUE)
35M 36M 40M 34M 33M 14M4I17M 15M4I16M 16M4I15M
2804 283 112 37 6 4 4 3
7M4I24M 17M4I14M 9M2I24M 10M4I21M 11M5I19M 12M4I19M 13M1D22M 13M4I18M
3 2 2 1 1 1 1 1
14M5I17M 18M4I13M 18M5I12M 7M2I31M 8M4I24M 9M1D26M
1 1 1 1 1 1
Counting Overlaps Between Features
findOverlaps
. ce.exons
object to illustrate basic principlesOverlap the first exon with the others
ce.exons[1:5]
GRanges object with 5 ranges and 1 metadata column:
seqnames ranges strand | exon_id
<Rle> <IRanges> <Rle> | <integer>
[1] chrI [11495, 11561] + | 1
[2] chrI [11499, 11557] + | 2
[3] chrI [11499, 11561] + | 3
[4] chrI [11505, 11561] + | 4
[5] chrI [11618, 11689] + | 5
-------
seqinfo: 7 sequences (1 circular) from ce6 genome
findOverlaps(ce.exons[1], ce.exons[-1], ignore.strand = TRUE)
Hits object with 3 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
[2] 1 2
[3] 1 3
-------
queryLength: 1 / subjectLength: 146832
Distribution of # overlaps for first 100 exons
E = array()
for (i in 1:100) {
F = findOverlaps(ce.exons[i], ce.exons[-i], ignore.strand = TRUE)
E[i] = length(queryHits(F))
}
hist(E)
Palindromes witin exons: the Plan
findOverlaps
P = findPalindromes(BSgenome.Celegans.UCSC.ce6$chrI)
length(P)
[1] 213154
The palindromes
P
Views on a 15072421-letter DNAString subject
subject: GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGC...CTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC
views:
start end width
[1] 460 468 9 [TTTTCAAAA]
[2] 488 497 10 [AAAACGTTTT]
[3] 507 514 8 [GAAGCTTC]
[4] 573 582 10 [TTAGGCCTAA]
[5] 856 863 8 [GTAGCTAC]
... ... ... ... ...
[213150] 15071791 15071798 8 [AAAATTTT]
[213151] 15071985 15071993 9 [GTCTGAGAC]
[213152] 15072046 15072053 8 [TAGTACTA]
[213153] 15072053 15072064 12 [AGTGCCGGCACT]
[213154] 15072085 15072093 9 [GAATGATTC]
class(ce.exons)
[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"
Converting to GRanges
Pg = GRanges(seq = "chrI", IRanges(start = start(P), end = end(P)))
Pg
GRanges object with 213154 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chrI [460, 468] *
[2] chrI [488, 497] *
[3] chrI [507, 514] *
[4] chrI [573, 582] *
[5] chrI [856, 863] *
... ... ... ...
[213150] chrI [15071791, 15071798] *
[213151] chrI [15071985, 15071993] *
[213152] chrI [15072046, 15072053] *
[213153] chrI [15072053, 15072064] *
[213154] chrI [15072085, 15072093] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Finding the overlaps
F = findOverlaps(Pg, ce.exons)
length(F)
[1] 46436
Selecting palindromes by width
Maybe longer palindromes are underrepresented within exons
The histogram suggests a cutoff of 101.2 (15) might define a subset of long palindromes
hist(log10(width(P)))
There are 5889 long palindromes on chr1
P15 = P[width(P) > 15]
Pg15 = GRanges(seq = "chrI", IRanges(start = start(P15), end = end(P15)))
length(Pg15)
[1] 5889
Long palidromes seem to avoid exons
F15 = findOverlaps(Pg15, ce.exons)
length(F15)
[1] 310
length(F15)/length(Pg15)
[1] 0.05264052
length(F)/length(Pg)
[1] 0.2178519