David Wheeler, PhD, Laboratory of Biochemistry and Molecular Biology, NCI
11/22/2016
What is Bioconductor?“Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. It has two releases each year, 1296 software packages, and an active user community.”
Bioconductor Packages by Category| Software (1289) | AnnotationData (933) | ExperimentData (305) |
|---|---|---|
| AssayDomain (485) | ChipManufacturer (403) | Cancer (72) |
| BiologicalQuestion (459) | ChipName (195) | MicrobiomeData (2) |
| Infrastructure (230) | CustomArray (2) | RNASeqData (23) |
| ResearchField (193) | CustomCDF (16) | ChIPseqData (8) |
| StatisticalMethod (318) | Organism (600) | EColiData (2) |
| Technology (696) | FunctionalAnnotation (13) | ExpressionData (7) |
| WorkflowStep (675) | HapMap (7) |
Installing Bioconductor# source the R coded installer
source("http://www.bioconductor.org/biocLite.R")
biocLite() # run the installer to get the basic library
biocLite("GenomicRanges") # install a specific library
Loading Biobase Generates a Reportlibrary(Biobase) #load Biobase
In addition to Biobase, two other packages upon which Biobase depends are automatically loaded. These are BiocGenerics and parallel.
Some functions from other packages are “masked”. This means that a recently loaded library defines a function with the same name as that of a previously loaded library–the new function supercedes the old.
Vignettes for a package take you through a typical analysis and are a good way to get acquainted with the package.
Biobase comes with Sample Data
data(package="Biobase")
Data sets in package ‘Biobase’:
SW Class to Contain High-Throughput Assays and
Experimental Metadata
aaMap Dataset: Names and Characteristics of Amino
Acids
geneCov Sample expression matrix and phenotype
data.frames.
geneCovariate Sample expression matrix and phenotype
data.frames.
geneData Sample expression matrix and phenotype
data.frames.
reporter Example data.frame representing reporter
information
sample.ExpressionSet Dataset of class 'ExpressionSet'
sample.MultiSet Data set of class 'MultiSet'
seD Sample expression matrix and phenotype
data.frames.
Let's Load a Small Dataset...data("aaMap") # load 'aaMap'
aaMap[1:2, ] # take a look
name let.1 let.3 scProp hyPhilic acidic
1 alanine A ala nonpolar FALSE NA
2 cysteine C cys polar NA NA
table(aaMap$scProp) # tabulate side chain properties
nonpolar polar
9 11
Partial Listing Biobase Functionsls("package:Biobase")[c(6, 12, 125, 101, 89, 91, 71, 49, 51, 45, 41, 44)] # get a listing of some of the methods of Biobase
[1] "AnnotatedDataFrame" "assayData" "varMetadata"
[4] "sampleNames" "protocolData" "pubMedIds"
[7] "MIAME" "featureData" "featureNames"
[10] "exprs" "experimentData" "ExpressionSet"
The ExpressionSetRequired format for most functions dealing with microarray data
Objects with very similar structure used for other types of data (e.g. SummarizedExperiment for NGS data, MRexperiment for microbiome data)
So, let's build one from basic components
Constructing an ExpressionSetExpressionSet is used to contain microarray data. | assayData (exprs) | phenoData | experimentData | annotation |
|---|---|---|---|
| environment | AnnotatedDataFrame | MIAME object | character |
| expression data | sample information | description of experiment | name the matching annotation database |
| featureData | protocolData | classVersion |
|---|---|---|
| AnnotatedDataFrame | AnnotatedDataFrame | Versions |
| feature information | protocol information |
The PlanBiobase to build an ExpressionSet from scratch using 4 of the 7 slotsfeatureData and protocolData for the sake of brevity.classVersion slot is filled in automatically and need not concern us. First Component: Expression MatrixgeneData for the first component.data(geneData) # a sample expression matrix
colnames(geneData) # the sample names
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
geneData[1:2, 1:3] # the matrix itself
A B C
AFFX-MurIL2_at 192.742 85.7533 176.7570
AFFX-MurIL10_at 97.137 126.1960 77.9216
Second Component: Phenotype Datadata.frames we use to build the AnnotatedDataFrame of the phenoData slot.data(geneCov) # the sample information
head(geneCov) # take a look at what we have
cov1 cov2 cov3
A 1 1 1
B 1 1 1
C 1 1 1
D 1 1 1
E 1 2 1
F 1 2 1
How can we change cov1 to 'gender', cov2 to “case”, and cov3 to “type”?
Naming the Covariatesnames() function to assign new names to the covariatesnames(geneCov)
[1] "cov1" "cov2" "cov3"
names(geneCov) = c("gender", "case", "type")
names(geneCov)
[1] "gender" "case" "type"
Annotating the Phenodatadata.frame to the phenotype datadata.frame constructor is adding 3 columns to the frame by name while the second is labeling the rows.# Here we describe the columns--this is the second data.frame of phenData
metadata = data.frame(labelDescription = c("patient gender", "Case/control",
"Tumor type"), row.names = c("gender", "case", "type"))
metadata
labelDescription
gender patient gender
case Case/control
type Tumor type
The Phenodata ConstructorphenoData component which is an AnnotatedDataFrameAnnotated DataFrame consists of one data.frame to hold the covariate data and a second that describes the covariates. new() function is a constructor, taking as its first argument the type of object to construct.# construct the `phenoData` component.
phenoData = new("AnnotatedDataFrame", data = geneCov, varMetadata = metadata)
phenoData
An object of class 'AnnotatedDataFrame'
rowNames: A B ... Z (26 total)
varLabels: gender case type
varMetadata: labelDescription
Third Component: Annotationhgu95av2 annotation database.It provides mappings between probe identifiers and biological data
| hgu95av2ACCNUM | hgu95av2CHR | hgu95av2CHRLOC | hgu95av2ENZYME |
| hgu95av2GENENAME | hgu95av2GO | hgu95av2MAP | hgu95av2OMIM |
| hgu95av2PFAM | hgu95av2PMID | hgu95av2PROSITE | hgu95av2REFSEQ |
| hgu95av2UNIGENE | hgu95av2CHRLENGTHS | hgu95av2ENZYME2PROBE | hgu95av2GO2ALLPROBES |
| hgu95av2PATH2PROBE | hgu95av2PMID2PROBE |
This is an Easy Oneannotation = "hgu95av2" # add a reference to our annotation database
The MIAMI ComponentexperimentData = new("MIAME", name = "C. Darwin", lab = "CCR", contact = "email.gov",
title = "Small Experiment", abstract = "An example ExpressionSet", url = "www.lab.gov")
Putting it all TogetherNow, we can build the ExpressionSet from the 4 components using the ExpressionSet() constructor.
eset = ExpressionSet(assayData = geneData, phenoData = phenoData, experimentData = experimentData,
annotation = annotation)
str(eset)
Formal class 'ExpressionSet' [package "Biobase"] with 7 slots
..@ experimentData :Formal class 'MIAME' [package "Biobase"] with 13 slots
.. .. ..@ name : chr "C. Darwin"
.. .. ..@ lab : chr "CCR"
.. .. ..@ contact : chr "email.gov"
.. .. ..@ title : chr "Small Experiment"
.. .. ..@ abstract : chr "An example ExpressionSet"
.. .. ..@ url : chr "www.lab.gov"
.. .. ..@ pubMedIds : chr ""
.. .. ..@ samples : list()
.. .. ..@ hybridizations : list()
.. .. ..@ normControls : list()
.. .. ..@ preprocessing : list()
.. .. ..@ other : list()
.. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. .. .. ..@ .Data:List of 2
.. .. .. .. .. ..$ : int [1:3] 1 0 0
.. .. .. .. .. ..$ : int [1:3] 1 1 0
..@ assayData :<environment: 0x7f94fbb63378>
..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
.. .. ..@ varMetadata :'data.frame': 3 obs. of 1 variable:
.. .. .. ..$ labelDescription: chr [1:3] "patient gender" "Case/control" "Tumor type"
.. .. ..@ data :'data.frame': 26 obs. of 3 variables:
.. .. .. ..$ gender: num [1:26] 1 1 1 1 1 1 1 1 1 1 ...
.. .. .. ..$ case : num [1:26] 1 1 1 1 2 2 2 2 2 2 ...
.. .. .. ..$ type : num [1:26] 1 1 1 1 1 1 1 1 2 2 ...
.. .. ..@ dimLabels : chr [1:2] "sampleNames" "sampleColumns"
.. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. .. .. ..@ .Data:List of 1
.. .. .. .. .. ..$ : int [1:3] 1 1 0
..@ featureData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
.. .. ..@ varMetadata :'data.frame': 0 obs. of 1 variable:
.. .. .. ..$ labelDescription: chr(0)
.. .. ..@ data :'data.frame': 500 obs. of 0 variables
.. .. ..@ dimLabels : chr [1:2] "featureNames" "featureColumns"
.. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. .. .. ..@ .Data:List of 1
.. .. .. .. .. ..$ : int [1:3] 1 1 0
..@ annotation : chr "hgu95av2"
..@ protocolData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
.. .. ..@ varMetadata :'data.frame': 0 obs. of 1 variable:
.. .. .. ..$ labelDescription: chr(0)
.. .. ..@ data :'data.frame': 26 obs. of 0 variables
.. .. ..@ dimLabels : chr [1:2] "sampleNames" "sampleColumns"
.. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. .. .. ..@ .Data:List of 1
.. .. .. .. .. ..$ : int [1:3] 1 1 0
..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
.. .. ..@ .Data:List of 4
.. .. .. ..$ : int [1:3] 3 3 0
.. .. .. ..$ : int [1:3] 2 32 0
.. .. .. ..$ : int [1:3] 1 3 0
.. .. .. ..$ : int [1:3] 1 0 0
Accessor FunctionsExpressionSet is an object of a new type called 'S4' Such objects provide accessor functions to allow us to get at their data.
The data in the 7 slots is available through the accessor functions shown earlier in the table.
Note that the covariates are listed in the @data data.frame within the @phenoData slot and that they are prefixed with a $ sign.
These elements can be reached directly as though they were elements of a list, e.g. eset$gender.
Analysis of the First Covariate, Gendereset.varMetadata accessor function to see them.varMetadata(eset) # show the metadata about the phenotypes
labelDescription
gender patient gender
case Case/control
type Tumor type
Let's examine the effect of gender on our set of 500 probes.
The Planexprs() function extracts the expression matrix of 500 probes (rows) and 26 samples (columns) from eset exprs() to get at the dataPartitioning by Genderdata.frame in the phenoData slot of eset contains 26 entries, one per sample
$ operatorE = exprs(eset) # extract the exression data as a matrix
gender = eset$gender # we can get the gender array this way
gender # take a look
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
The Plangender as a grouping vector for our 26 samplesboxplot a model for the expression data in the first row that says:E[1,] is.a.function.of gender
The is.a.function.of operator is the tilde ~
The Boxplotboxplot(E[1, ] ~ gender)
A T-Test by Gendert.test of the probe intensitiest.test(E[1, ] ~ gender)
Welch Two Sample t-test
data: E[1, ] by gender
t = -1.0582, df = 23.578, p-value = 0.3007
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-54.97275 17.73203
sample estimates:
mean in group 1 mean in group 2
105.8006 124.4210
Scale up for all 500 Probesfor loop.T = vector() # initialize an array for P-Values
for (n in 1:500) {
T[n] = t.test(E[n, ] ~ gender)$p.value
} # do just what we did above for each probe and extract the P-value from the t.test object
Did the T-Test Loop Work?length(T) # just check to make sure we got 500 P-vals
[1] 500
head(T) # take a look
[1] 0.3006994 0.5717812 0.9852404 0.3062817 0.6786697 0.6793440
which(T <= 0.05) #use 'which' to see the indices of the significant cases
[1] 30 85 147 199 279 286 301 376 400
How to Make 9 Boxplots: the PlanLet's make boxplots for these 9 interesting probes.
We first divide the graphical display into 12 areas
Next, we plot 9 boxplots and label them with probe id and P-Value
The Codelayout(matrix(1:9, 3, 3))
F = featureNames(eset) # if we want to add the feature names to the graph title
for (n in which(T <= 0.05)) {
boxplot(E[n, ] ~ gender, main = paste(F[n], format(T[n], digits = 3)))
} # loop through all cases where the P-val <= 0.05
The 9 BoxplotsAdding Our Own CovariategeneData using hierarchical clusteringLet's add those groups to the phenoData slot of eset and reanalyze.
First, we'll create the distance matrix needed by hclust()
# transpose the expression matrix to cluster samples (columns) rather than probes (rows)
d=dist(t(E))
h=hclust(d) # perform the clustering
plot(h) # plot the cluster dendrogram
abline(h=14000,col="red")
Here's the Dendrogram# transpose the expression matrix to cluster samples (columns) rather than
# probes (rows)
d = dist(t(E))
h = hclust(d) # perform the clustering
plot(h) # plot the cluster dendrogram
abline(h = 14000, col = "red")
Cutting the Dendrogramgroups = factor(cutree(h, h = 14000)) # cut the tree at 1400
groups # take a look at which samples fall into which groups
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 1 2 1 1 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 2 2 1 1 1 1
Levels: 1 2
Adding the New Covariatehc.group.eset$hc.group = groups # add the new covariate
varMetadata(eset) # verify that we have done what we wanted to do
labelDescription
gender patient gender
case Case/control
type Tumor type
hc.group <NA>
Running the LoopT = vector()
for (n in 1:500) {
T[n] = t.test(E[n, ] ~ eset$hc.group)$p.value
} # perform the 500 T-tests
T=array();for(n in 1:500){T[n]=t.test(E[n,]~groups)$p.value}
How many probes pass the test?which(T <= 0.05) # see which pass the test
[1] 1 3 6 8 12 20 22 23 24 25 27 28 31 35 36 38 46
[18] 47 48 51 58 65 67 73 74 82 86 89 91 96 106 110 117 118
[35] 121 122 123 126 128 129 131 134 139 140 146 153 157 158 163 169 174
[52] 177 178 180 181 184 185 186 189 192 193 196 197 201 203 204 205 207
[69] 208 210 214 218 220 222 224 226 232 233 235 236 237 239 242 251 253
[86] 256 262 263 264 265 266 269 270 272 275 276 281 282 288 290 291 295
[103] 296 297 298 299 302 304 306 307 309 312 313 315 316 317 320 326 327
[120] 328 329 333 334 336 341 343 344 346 351 353 355 362 363 369 373 375
[137] 379 381 383 384 387 389 390 392 393 396 397 398 403 404 405 409 410
[154] 423 426 431 432 433 436 437 441 445 446 452 455 458 460 465 467 469
[171] 470 471 474 476 482 483 485 487 494 495 499
Getting Minimum Valuesmin()which.min():min(T)
[1] 3.268072e-10
which.min(T) # Let's see the index of the probe with the best P-value
[1] 272
Boxplotting the Best Caseboxplot() this best case to see how good it can get.
boxplot(E[which.min(T),]~eset$hc.group,main=paste("Probe",which.min(T),"P-val:",format(T[which.min(T)],digits=3)))
The Best Caseboxplot(E[which.min(T), ] ~ eset$hc.group, main = paste("Probe", which.min(T),
"P-val:", format(T[which.min(T)], digits = 3)))
Getting the Probe ID for the Best CasefeatureNames(eset[which.min(T)])
[1] "31511_at"
BiostringsThe Biostrings library provides a container for DNA, RNA, or protein sequences called a BStringSet. These come in three varieties:
| DNA | RNA | Proteins, Peptides |
|---|---|---|
| DNAStringSet() | RNAStringSet() | AAStringSet() |
Let's load the Biostrings library and take a look at the sample data provided:
library(Biostrings) # load the library 'Biostrings'
BiostringsNext, let's see if there is any sample data to explore:
data(package = "Biostrings") # list the example data sets of 'Biostrings'
Data sets in package ‘Biostrings’:
BLOSUM100 Scoring matrices
BLOSUM45 Scoring matrices
BLOSUM50 Scoring matrices
BLOSUM62 Scoring matrices
BLOSUM80 Scoring matrices
HNF4alpha Known HNF4alpha binding sequences
PAM120 Scoring matrices
PAM250 Scoring matrices
PAM30 Scoring matrices
PAM40 Scoring matrices
PAM70 Scoring matrices
phiX174Phage Versions of bacteriophage phiX174
complete genome and sample short reads
...
BiostringsTo try out some functions found in Biostrings, load the phiX174Phage stringset.
data(phiX174Phage) # load one of the sample data sets
phiX174Phage # see how it looks
A DNAStringSet instance of length 6
width seq names
[1] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA Genbank
[2] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA RF70s
[3] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA SS78
[4] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA Bull
[5] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA G97
[6] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA NEB03
BiostringsRemember that you can determine an object's class using class().
class(phiX174Phage) # determine the object class of 'phiX174Phage'
[1] "DNAStringSet"
attr(,"package")
[1] "Biostrings"
Biostringsnames(phiX174Phage)
[1] "Genbank" "RF70s" "SS78" "Bull" "G97" "NEB03"
phiX174Phage$Genbank
5386-letter "DNAString" instance
seq: GAGTTTTATCGCTTCCATGACGCAGAAGTTAACA...TTCGATAAAAATGATTGGCGTATCCAACCTGCA
Biostrings: Creating a DNAString objectHere we make a DNAString to contain the sequence of exon 4 of the human oncogene homologue (c-abl) of the abelson murine leukemia virus.
abl1 = DNAString("GATCTTGCTGCCCGAAACTGCCTGGTAGGGGAGAACCACTTGGTGAAGGTAGCTGATTTTGGCCTGAGCAGGTTGATGACAGGGGACACCTACACAGCCCATGCTGGAGCCAAGTTCCCCATCAAATGGACTGCACCCGAGAGCCTGGCCTACAACAAGTTCTCCATCAAGTCCGACGTCTGGGGTAAGGGC") # use a 'constructor' to make a DNAString
abl1 # take a look at the new object
192-letter "DNAString" instance
seq: GATCTTGCTGCCCGAAACTGCCTGGTAGGGGAGA...TTCTCCATCAAGTCCGACGTCTGGGGTAAGGGC
RNAString(abl1) # Now, feed the new DNAString into RNAString to convert
192-letter "RNAString" instance
seq: GAUCUUGCUGCCCGAAACUGCCUGGUAGGGGAGA...UUCUCCAUCAAGUCCGACGUCUGGGGUAAGGGC
Biostrings: Basic Functionsreverse(abl1) # reverse the sequence
192-letter "DNAString" instance
seq: CGGGAATGGGGTCTGCAGCCTGAACTACCTCTTG...GAGGGGATGGTCCGTCAAAGCCCGTCGTTCTAG
complement(abl1) # generate the complement of the sequence
192-letter "DNAString" instance
seq: CTAGAACGACGGGCTTTGACGGACCATCCCCTCT...AAGAGGTAGTTCAGGCTGCAGACCCCATTCCCG
reverseComplement(abl1) # generate the reverse complement
192-letter "DNAString" instance
seq: GCCCTTACCCCAGACGTCGGACTTGATGGAGAAC...CTCCCCTACCAGGCAGTTTCGGGCAGCAAGATC
Biostrings: Basic Functionsabl1.aas = translate(abl1) # translate the sequence
abl1.aas
64-letter "AAString" instance
seq: DLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWGKG
length(abl1.aas)
[1] 64
Biostrings: alphabetFrequency()The function alphabetFrequency looks interesting, so let's try it.
a = alphabetFrequency(phiX174Phage) # generate a table of base counts
a
A C G T M R W S Y K V H D B N - + .
[1,] 1291 1157 1254 1684 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,] 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[4,] 1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[5,] 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[6,] 1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Can we add rownames to this output?
Biostrings: alphabetFrequency()names(phiX174Phage)
[1] "Genbank" "RF70s" "SS78" "Bull" "G97" "NEB03"
rownames(a) = names(phiX174Phage)
a
A C G T M R W S Y K V H D B N - + .
Genbank 1291 1157 1254 1684 0 0 0 0 0 0 0 0 0 0 0 0 0 0
RF70s 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SS78 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Bull 1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G97 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
NEB03 1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0
How do we use this matrix?(a[, "G"] + a[, "C"])/(a[, "G"] + a[, "C"] + a[, "A"] + a[, "T"]) # use the variable
Genbank RF70s SS78 Bull G97 NEB03
0.4476420 0.4472707 0.4472707 0.4470850 0.4472707 0.4470850
We can easily take a couple more steps.dinucleotideFrequency(phiX174Phage) # get a table of dinucleotide counts
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
[1,] 395 261 251 383 257 229 267 404 327 347 255 325 312 320 480 572
[2,] 394 260 253 384 258 229 265 404 327 347 254 325 313 320 480 572
[3,] 394 260 253 384 258 229 265 404 327 347 254 325 313 320 480 572
[4,] 395 260 252 384 257 229 266 403 327 346 254 326 313 320 480 573
[5,] 395 261 253 382 257 229 266 404 326 346 254 327 314 320 479 572
[6,] 394 260 253 384 258 227 265 405 327 347 254 325 313 321 480 572
trinucleotideFrequency(phiX174Phage) # get a table of trinucleotide counts
AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA
[1,] 133 67 93 102 47 62 63 89 72 53 74 52 60 56 140 127 91
[2,] 133 66 93 102 47 62 62 89 73 53 75 52 60 56 141 127 91
[3,] 133 66 93 102 47 62 62 89 73 53 75 52 60 56 141 127 91
[4,] 133 66 93 103 47 62 62 89 73 53 74 52 60 56 140 128 91
[5,] 133 66 93 103 47 62 62 90 73 53 75 52 60 56 139 127 91
[6,] 133 66 93 102 47 62 62 89 73 53 75 52 60 56 141 127 91
CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC
[1,] 40 61 64 59 38 70 62 52 92 31 92 78 67 132 127 70 82
[2,] 40 62 64 59 38 70 62 52 92 30 91 78 67 132 127 69 82
[3,] 40 62 64 59 38 70 62 52 92 30 91 78 67 132 127 69 82
[4,] 39 62 64 58 38 71 62 52 92 30 92 78 67 132 126 70 82
[5,] 40 62 63 58 38 71 62 52 92 30 92 78 67 132 127 70 82
[6,] 40 62 64 59 36 69 63 52 92 30 91 78 68 132 127 69 82
GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG
[1,] 74 101 60 71 67 149 63 72 20 100 54 65 66 140 101 72 23
[2,] 74 102 61 71 66 149 63 72 19 100 55 65 65 140 101 72 24
[3,] 74 102 61 71 66 149 63 72 19 100 55 65 65 140 101 72 24
[4,] 74 101 61 71 66 148 62 72 20 100 54 65 67 140 101 73 23
[5,] 74 100 61 71 66 148 62 72 19 101 55 65 67 140 101 73 24
[6,] 74 102 61 71 66 149 63 72 19 100 55 65 65 140 101 72 24
TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT
[1,] 116 91 58 67 104 139 130 130 81 120 132 142 178
[2,] 116 91 58 67 104 138 130 130 82 120 132 142 178
[3,] 116 91 58 67 104 138 130 130 82 120 132 142 178
[4,] 116 91 58 67 104 139 129 130 82 121 132 141 179
[5,] 116 91 58 67 104 138 129 130 82 121 132 141 178
[6,] 116 91 58 68 104 138 130 130 82 120 132 142 178
O = oligonucleotideFrequency(phiX174Phage, 5)
hist(O)
BSgenomeThe BSgenome library provides a container for genomes. A number of genomes have been packaged for Bioconductor.
library(BSgenome) # load the library 'BSgenome'
available.genomes()[1:20] # get a list of genomes you can install
[1] "BSgenome.Alyrata.JGI.v1"
[2] "BSgenome.Amellifera.BeeBase.assembly4"
[3] "BSgenome.Amellifera.UCSC.apiMel2"
[4] "BSgenome.Amellifera.UCSC.apiMel2.masked"
[5] "BSgenome.Athaliana.TAIR.04232008"
[6] "BSgenome.Athaliana.TAIR.TAIR9"
[7] "BSgenome.Btaurus.UCSC.bosTau3"
[8] "BSgenome.Btaurus.UCSC.bosTau3.masked"
[9] "BSgenome.Btaurus.UCSC.bosTau4"
[10] "BSgenome.Btaurus.UCSC.bosTau4.masked"
[11] "BSgenome.Btaurus.UCSC.bosTau6"
[12] "BSgenome.Btaurus.UCSC.bosTau6.masked"
[13] "BSgenome.Btaurus.UCSC.bosTau8"
[14] "BSgenome.Celegans.UCSC.ce10"
[15] "BSgenome.Celegans.UCSC.ce11"
[16] "BSgenome.Celegans.UCSC.ce2"
[17] "BSgenome.Celegans.UCSC.ce6"
[18] "BSgenome.Cfamiliaris.UCSC.canFam2"
[19] "BSgenome.Cfamiliaris.UCSC.canFam2.masked"
[20] "BSgenome.Cfamiliaris.UCSC.canFam3"
Let's load the C. elegans genome to experiment.
library(BSgenome.Celegans.UCSC.ce6) # load your local copy of the C. elegans genome
What can we do with BSgenomes?
Use ls("package:BSgenome") to seels("package:BSgenome")[1:20] # list the methods of 'BSgenome'
[1] "alphabetFrequency" "as.data.frame"
[3] "as.data.frame.BSgenomeViews" "as.list"
[5] "available.genomes" "available.SNPs"
[7] "blocksizes" "breakpoints"
[9] "bsapply" "BSgenome"
[11] "BSgenomeViews" "coerce"
[13] "colnames" "commonName"
[15] "compatibleGenomes" "consensusMatrix"
[17] "consensusString" "countPWM"
[19] "elementNROWS" "end"
BSgenome.Celegans.UCSC.ce6 # see some header information
Worm genome:
# organism: Caenorhabditis elegans (Worm)
# provider: UCSC
# provider version: ce6
# release date: May 2008
# release name: WormBase v. WS190
# 7 sequences:
# chrI chrII chrIII chrIV chrV chrX chrM
# (use 'seqnames()' to see all the sequence names, use the '$' or '[['
# operator to access a given sequence)
BSgenome.Celegans.UCSC.ce6[["chrI"]] # display chromosome I
15072421-letter "DNAString" instance
seq: GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT...GGCTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC
seqinfo(BSgenome.Celegans.UCSC.ce6) # get a summary of the sequences included
Seqinfo object with 7 sequences (1 circular) from ce6 genome:
seqnames seqlengths isCircular genome
chrI 15072421 FALSE ce6
chrII 15279323 FALSE ce6
chrIII 13783681 FALSE ce6
chrIV 17493785 FALSE ce6
chrV 20919568 FALSE ce6
chrX 17718854 FALSE ce6
chrM 13794 TRUE ce6
seqnames(BSgenome.Celegans.UCSC.ce6) # get only the seq names
[1] "chrI" "chrII" "chrIII" "chrIV" "chrV" "chrX" "chrM"
seqlengths(BSgenome.Celegans.UCSC.ce6) # get an array of the sequence lengths
chrI chrII chrIII chrIV chrV chrX chrM
15072421 15279323 13783681 17493785 20919568 17718854 13794
barplot(seqlengths(BSgenome.Celegans.UCSC.ce6)) # generate a quick barplot
What is the GC content of the C. elegans genome chromosome I?
F = alphabetFrequency(BSgenome.Celegans.UCSC.ce6[["chrI"]]) # get the frequency table
F # take look at the table
A C G T M R W S Y
4835939 2695879 2692150 4848453 0 0 0 0 0
K V H D B N - + .
0 0 0 0 0 0 0 0 0
sum(F[c("G", "C")])/sum(F[c("A", "T", "C", "G")]) * 100 # compute % GC
[1] 35.7476
Exercise: How can we comput %GC for all chromosomes?
Fr = vector() # initialize a vector to hold the %GC values
for (s in seqnames(BSgenome.Celegans.UCSC.ce6)) {
# loop through each of the 'seqnames'--these are the chromosomes
F = alphabetFrequency(BSgenome.Celegans.UCSC.ce6[[s]])
# get a table for one chromosome,compute %GC and store this in 'Fr[s]'
Fr[s] = sum(F[c("G", "C")])/sum(F[c("A", "T", "C", "G")]) * 100
}
Fr # display Fr to see if we got it right
chrI chrII chrIII chrIV chrV chrX chrM
35.74760 36.20196 35.66144 34.59386 35.42942 35.20301 23.77845
barplot(Fr, ylab = "%GC", xlab = "C. elegans Chromosome", main = "GC Percentage by Chromosome") # barplot it!
GenomicFeaturesThe GenomicFeatures library provides the functions for the extraction of the sequences of features such as genes, exons, introns, and promoters from BSgenomes.
Let's take a look at the C. elegans genome to start.
library(TxDb.Celegans.UCSC.ce6.ensGene) # load the C. elegans transcript database
By loading this transcript database, we have also loaded a new library called GenomicFeatures that is required to use the database. What new functions have we acquired?
ls("package:GenomicFeatures")[1:40] # list the methods of 'GenomicFeatures'
[1] "as.list" "asBED"
[3] "asGFF" "cds"
[5] "cdsBy" "cdsByOverlaps"
[7] "coverageByTranscript" "DEFAULT_CIRC_SEQS"
[9] "disjointExons" "distance"
[11] "exons" "exonsBy"
[13] "exonsByOverlaps" "extractTranscriptSeqs"
[15] "extractUpstreamSeqs" "features"
[17] "fiveUTRsByTranscript" "genes"
[19] "getChromInfoFromBiomart" "getChromInfoFromUCSC"
[21] "getPromoterSeq" "id2name"
[23] "intronsByTranscript" "isActiveSeq"
[25] "isActiveSeq<-" "makeFDbPackageFromUCSC"
[27] "makeFeatureDbFromUCSC" "makePackageName"
[29] "makeTxDb" "makeTxDbFromBiomart"
[31] "makeTxDbFromGFF" "makeTxDbFromGRanges"
[33] "makeTxDbFromUCSC" "makeTxDbPackage"
[35] "makeTxDbPackageFromBiomart" "makeTxDbPackageFromUCSC"
[37] "mapFromTranscripts" "mapIdsToRanges"
[39] "mapRangesToIds" "mapToTranscripts"
The functions exons(), genes(), and exons.By() look interesting so let's try them out:
ce.exons = exons(TxDb.Celegans.UCSC.ce6.ensGene) # get the exons coordinates
head(ce.exons, 2) # take a look
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | exon_id
<Rle> <IRanges> <Rle> | <integer>
[1] chrI [11495, 11561] + | 1
[2] chrI [11499, 11557] + | 2
-------
seqinfo: 7 sequences (1 circular) from ce6 genome
ce.genes = genes(TxDb.Celegans.UCSC.ce6.ensGene) # get the gene coordinates
head(ce.genes, 2) #take a look
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
2L52.1 chrII [ 1867, 4663] + | 2L52.1
2RSSE.1 chrII [15268114, 15273216] + | 2RSSE.1
-------
seqinfo: 7 sequences (1 circular) from ce6 genome
The gene_ids in the first metadata column are ENSEMBL ids and can be used to search in the UCSC Genome Browser.
# get the exons with their gene assignments
ce.exons.bygene = exonsBy(TxDb.Celegans.UCSC.ce6.ensGene, by = c("gene"))
head(ce.exons.bygene, 1) # take a look
GRangesList object of length 1:
$2L52.1
GRanges object with 7 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chrII [1867, 1911] + | 21558 <NA>
[2] chrII [2506, 2694] + | 21559 <NA>
[3] chrII [2738, 2888] + | 21560 <NA>
[4] chrII [2931, 3036] + | 21561 <NA>
[5] chrII [3406, 3552] + | 21562 <NA>
[6] chrII [3802, 3984] + | 21563 <NA>
[7] chrII [4201, 4663] + | 21564 <NA>
-------
seqinfo: 7 sequences (1 circular) from ce6 genome
GenomicRangesThe GenomicRanges library provides the framework for feature extraction and analysis. Among its most powerful functions are those used ot determine feature overlap. We load the library as usual:
library("GenomicRanges") # load the library 'GenomicRanges'
And we list the newly acquired functions in the usual manner:ls("package:GenomicRanges")[1:20] # list the methods of 'GenomicRanges'
[1] "absoluteRanges" "as.data.frame"
[3] "as.factor" "bindAsGRanges"
[5] "binnedAverage" "checkConstraint"
[7] "coerce" "countOverlaps"
[9] "coverage" "disjoin"
[11] "disjointBins" "distance"
[13] "distanceToNearest" "duplicated"
[15] "duplicated.GenomicRanges" "elementMetadata"
[17] "elementMetadata<-" "end"
[19] "end<-" "findOverlaps"
Let's try out a fewlength(ce.exons) # how many exons are there in C. elegans?
[1] 146833
head(width(ce.exons)) # get the lengths for the first 6 exons
[1] 67 59 63 57 72 67
Take a free histogramhist(width(ce.exons), main = "Distribution of Exon Lengths for C. Elegans")
The histogram looks a bit cramped. How can we correct this?
hist(log10(width(ce.exons)), main = "Distribution of Exon Lengths for C. Elegans")
GenomicAlignmentsThe GenomicAlignments library provides containers for storing and manipulating short genomic alignments (typically obtained by aligning short reads to a reference genome). This includes read counting, computing the coverage, junction detection, and working with the nucleotide sequences of the alignments.
library(GenomicAlignments)
Exercise: What new functions have we acquired?
bam <- readGAlignments(system.file("extdata", "ex1.bam", package = "Rsamtools"))
bam[1:2]
GAlignments object with 2 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end width
<Rle> <Rle> <character> <integer> <integer> <integer> <integer>
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
njunc
<integer>
[1] 0
[2] 0
-------
seqinfo: 2 sequences from an unspecified genome
B7_591:4:96:693:509 73 seq1 1 99 36M * 0 0
CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7
MF:i:18 Aq:i:73 NM:i:0 UQ:i:0 H0:i:1 H1:i:0
EAS54_65:7:152:368:113 73 seq1 3 99 35M * 0 0
CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6): MF:i:18 Aq:i:66 NM:i:0 UQ:i:0 H0:i:1 H1:i:0
Cigar StringsRefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Reference: C C A T A C T G A A C T G A C T A A C
Read: A C T A G A A T G G C T
With the alignment above, you get:
POS: 5
CIGAR: 3M1I3M1D5M
sort(table(cigar(bam)), decreasing = TRUE)
35M 36M 40M 34M 33M 14M4I17M 15M4I16M 16M4I15M
2804 283 112 37 6 4 4 3
7M4I24M 17M4I14M 9M2I24M 10M4I21M 11M5I19M 12M4I19M 13M1D22M 13M4I18M
3 2 2 1 1 1 1 1
14M5I17M 18M4I13M 18M5I12M 7M2I31M 8M4I24M 9M1D26M
1 1 1 1 1 1
Counting Overlaps Between FeaturesfindOverlaps. ce.exons object to illustrate basic principlesOverlap the first exon with the othersce.exons[1:5]
GRanges object with 5 ranges and 1 metadata column:
seqnames ranges strand | exon_id
<Rle> <IRanges> <Rle> | <integer>
[1] chrI [11495, 11561] + | 1
[2] chrI [11499, 11557] + | 2
[3] chrI [11499, 11561] + | 3
[4] chrI [11505, 11561] + | 4
[5] chrI [11618, 11689] + | 5
-------
seqinfo: 7 sequences (1 circular) from ce6 genome
findOverlaps(ce.exons[1], ce.exons[-1], ignore.strand = TRUE)
Hits object with 3 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
[2] 1 2
[3] 1 3
-------
queryLength: 1 / subjectLength: 146832
Distribution of # overlaps for first 100 exonsE = array()
for (i in 1:100) {
F = findOverlaps(ce.exons[i], ce.exons[-i], ignore.strand = TRUE)
E[i] = length(queryHits(F))
}
hist(E)
Palindromes witin exons: the PlanfindOverlapsP = findPalindromes(BSgenome.Celegans.UCSC.ce6$chrI)
length(P)
[1] 213154
The palindromesP
Views on a 15072421-letter DNAString subject
subject: GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGC...CTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC
views:
start end width
[1] 460 468 9 [TTTTCAAAA]
[2] 488 497 10 [AAAACGTTTT]
[3] 507 514 8 [GAAGCTTC]
[4] 573 582 10 [TTAGGCCTAA]
[5] 856 863 8 [GTAGCTAC]
... ... ... ... ...
[213150] 15071791 15071798 8 [AAAATTTT]
[213151] 15071985 15071993 9 [GTCTGAGAC]
[213152] 15072046 15072053 8 [TAGTACTA]
[213153] 15072053 15072064 12 [AGTGCCGGCACT]
[213154] 15072085 15072093 9 [GAATGATTC]
class(ce.exons)
[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"
Converting to GRangesPg = GRanges(seq = "chrI", IRanges(start = start(P), end = end(P)))
Pg
GRanges object with 213154 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chrI [460, 468] *
[2] chrI [488, 497] *
[3] chrI [507, 514] *
[4] chrI [573, 582] *
[5] chrI [856, 863] *
... ... ... ...
[213150] chrI [15071791, 15071798] *
[213151] chrI [15071985, 15071993] *
[213152] chrI [15072046, 15072053] *
[213153] chrI [15072053, 15072064] *
[213154] chrI [15072085, 15072093] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Finding the overlapsF = findOverlaps(Pg, ce.exons)
length(F)
[1] 46436
Selecting palindromes by widthMaybe longer palindromes are underrepresented within exons
The histogram suggests a cutoff of 101.2 (15) might define a subset of long palindromes
hist(log10(width(P)))
There are 5889 long palindromes on chr1P15 = P[width(P) > 15]
Pg15 = GRanges(seq = "chrI", IRanges(start = start(P15), end = end(P15)))
length(Pg15)
[1] 5889
Long palidromes seem to avoid exonsF15 = findOverlaps(Pg15, ce.exons)
length(F15)
[1] 310
length(F15)/length(Pg15)
[1] 0.05264052
length(F)/length(Pg)
[1] 0.2178519