Introduction to Bioconductor for Biologists

David Wheeler, PhD, Laboratory of Biochemistry and Molecular Biology, NCI
11/22/2016

`What is Bioconductor?`

“Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. It has two releases each year, 1296 software packages, and an active user community.”

Browse packages at:

http://bioconductor.org/packages/release/BiocViews.html

`Bioconductor Packages by Category`

Software (1289)	AnnotationData (933)	ExperimentData (305)
AssayDomain (485)	ChipManufacturer (403)	Cancer (72)
BiologicalQuestion (459)	ChipName (195)	MicrobiomeData (2)
Infrastructure (230)	CustomArray (2)	RNASeqData (23)
ResearchField (193)	CustomCDF (16)	ChIPseqData (8)
StatisticalMethod (318)	Organism (600)	EColiData (2)
Technology (696)	FunctionalAnnotation (13)	ExpressionData (7)
WorkflowStep (675)		HapMap (7)

`Installing Bioconductor`

# source the R coded installer
source("http://www.bioconductor.org/biocLite.R")

biocLite() # run the installer to get the basic library

biocLite("GenomicRanges") # install a specific library

`Loading Biobase Generates a Report`

library(Biobase)  #load Biobase

In addition to Biobase, two other packages upon which Biobase depends are automatically loaded. These are BiocGenerics and parallel.
Some functions from other packages are “masked”. This means that a recently loaded library defines a function with the same name as that of a previously loaded library–the new function supercedes the old.
Vignettes for a package take you through a typical analysis and are a good way to get acquainted with the package.

`Biobase comes with Sample Data`


data(package="Biobase")  

Data sets in package ‘Biobase’:

SW                      Class to Contain High-Throughput Assays and
                        Experimental Metadata
aaMap                   Dataset: Names and Characteristics of Amino
                        Acids
geneCov                 Sample expression matrix and phenotype
                        data.frames.
geneCovariate           Sample expression matrix and phenotype
                        data.frames.
geneData                Sample expression matrix and phenotype
                        data.frames.
reporter                Example data.frame representing reporter
                        information
sample.ExpressionSet    Dataset of class 'ExpressionSet'
sample.MultiSet         Data set of class 'MultiSet'
seD                     Sample expression matrix and phenotype
                        data.frames.

`Let's Load a Small Dataset...`

data("aaMap")  # load 'aaMap'

aaMap[1:2, ]  # take a look

      name let.1 let.3   scProp hyPhilic acidic
1  alanine     A   ala nonpolar    FALSE     NA
2 cysteine     C   cys    polar       NA     NA

table(aaMap$scProp)  # tabulate side chain properties


nonpolar    polar 
       9       11

`Partial Listing Biobase Functions`

There are over 120 function listed and many of them have to do with microarray analysis.

ls("package:Biobase")[c(6, 12, 125, 101, 89, 91, 71, 49, 51, 45, 41, 44)]  # get a listing of some of the methods of Biobase

 [1] "AnnotatedDataFrame" "assayData"          "varMetadata"       
 [4] "sampleNames"        "protocolData"       "pubMedIds"         
 [7] "MIAME"              "featureData"        "featureNames"      
[10] "exprs"              "experimentData"     "ExpressionSet"

`The ExpressionSet`

Required format for most functions dealing with microarray data
Objects with very similar structure used for other types of data (e.g. SummarizedExperiment for NGS data, MRexperiment for microbiome data)
So, let's build one from basic components

`Constructing an ExpressionSet`

The ExpressionSet is used to contain microarray data.
It is comprised of 7 components, called slots, of the types and content listed.
The table headings give the accessor functions used to access the data in the slots.

assayData (exprs)	phenoData	experimentData	annotation
environment	AnnotatedDataFrame	MIAME object	character
expression data	sample information	description of experiment	name the matching annotation database

featureData	protocolData	classVersion
AnnotatedDataFrame	AnnotatedDataFrame	Versions
feature information	protocol information

`The Plan`

We'll draw from the sample data available within Biobase to build an ExpressionSet from scratch using 4 of the 7 slots
We will omit the optional featureData and protocolData for the sake of brevity.
The classVersion slot is filled in automatically and need not concern us.

`First Component: Expression Matrix`

We will use geneData for the first component.

data(geneData)  # a sample expression matrix
colnames(geneData)  #  the sample names

 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

geneData[1:2, 1:3]  # the matrix itself

                      A        B        C
AFFX-MurIL2_at  192.742  85.7533 176.7570
AFFX-MurIL10_at  97.137 126.1960  77.9216

`Second Component: Phenotype Data`

We need some phenotype data that we can use to classify the samples.
This will be the first of two data.frames we use to build the AnnotatedDataFrame of the phenoData slot.

data(geneCov)  # the sample information

head(geneCov)  # take a look at what we have

  cov1 cov2 cov3
A    1    1    1
B    1    1    1
C    1    1    1
D    1    1    1
E    1    2    1
F    1    2    1

How can we change cov1 to 'gender', cov2 to “case”, and cov3 to “type”?

`Naming the Covariates`

We can use the names() function to assign new names to the covariates

names(geneCov)

[1] "cov1" "cov2" "cov3"

names(geneCov) = c("gender", "case", "type")
names(geneCov)

[1] "gender" "case"   "type"

`Annotating the Phenodata`

We can describe the columns by adding a second data.frame to the phenotype data
The first argument to the data.frame constructor is adding 3 columns to the frame by name while the second is labeling the rows.

# Here we describe the columns--this is the second data.frame of phenData

metadata = data.frame(labelDescription = c("patient gender", "Case/control", 
    "Tumor type"), row.names = c("gender", "case", "type"))

metadata

       labelDescription
gender   patient gender
case       Case/control
type         Tumor type

`The Phenodata Constructor`

Now we can construct the phenoData component which is an AnnotatedDataFrame
An Annotated DataFrame consists of one data.frame to hold the covariate data and a second that describes the covariates.
The new() function is a constructor, taking as its first argument the type of object to construct.

# construct the `phenoData` component.

phenoData = new("AnnotatedDataFrame", data = geneCov, varMetadata = metadata)
phenoData

An object of class 'AnnotatedDataFrame'
  rowNames: A B ... Z (26 total)
  varLabels: gender case type
  varMetadata: labelDescription

`Third Component: Annotation`

This is simply a reference to the hgu95av2 annotation database.
The database used is the Affymetrix Human Genome U95 Set annotation data assembled from public repositories

It provides mappings between probe identifiers and biological data


hgu95av2ACCNUM	hgu95av2CHR	hgu95av2CHRLOC	hgu95av2ENZYME
hgu95av2GENENAME	hgu95av2GO	hgu95av2MAP	hgu95av2OMIM
hgu95av2PFAM	hgu95av2PMID	hgu95av2PROSITE	hgu95av2REFSEQ
hgu95av2UNIGENE	hgu95av2CHRLENGTHS	hgu95av2ENZYME2PROBE	hgu95av2GO2ALLPROBES
hgu95av2PATH2PROBE	hgu95av2PMID2PROBE

`This is an Easy One`

annotation = "hgu95av2"  # add a reference to our annotation database

`The MIAMI Component`

The final component is a description of the experiment, following the MIAMI (Minimum Information About a Microarray Experiment) specification.

experimentData = new("MIAME", name = "C. Darwin", lab = "CCR", contact = "email.gov", 
    title = "Small Experiment", abstract = "An example ExpressionSet", url = "www.lab.gov")

`Putting it all Together`

Now, we can build the ExpressionSet from the 4 components using the ExpressionSet() constructor.

eset = ExpressionSet(assayData = geneData, phenoData = phenoData, experimentData = experimentData, 
    annotation = annotation)

str(eset)

Formal class 'ExpressionSet' [package "Biobase"] with 7 slots
  ..@ experimentData   :Formal class 'MIAME' [package "Biobase"] with 13 slots
  .. .. ..@ name             : chr "C. Darwin"
  .. .. ..@ lab              : chr "CCR"
  .. .. ..@ contact          : chr "email.gov"
  .. .. ..@ title            : chr "Small Experiment"
  .. .. ..@ abstract         : chr "An example ExpressionSet"
  .. .. ..@ url              : chr "www.lab.gov"
  .. .. ..@ pubMedIds        : chr ""
  .. .. ..@ samples          : list()
  .. .. ..@ hybridizations   : list()
  .. .. ..@ normControls     : list()
  .. .. ..@ preprocessing    : list()
  .. .. ..@ other            : list()
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 2
  .. .. .. .. .. ..$ : int [1:3] 1 0 0
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ assayData        :<environment: 0x7f94fbb63378> 
  ..@ phenoData        :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
  .. .. ..@ varMetadata      :'data.frame': 3 obs. of  1 variable:
  .. .. .. ..$ labelDescription: chr [1:3] "patient gender" "Case/control" "Tumor type"
  .. .. ..@ data             :'data.frame': 26 obs. of  3 variables:
  .. .. .. ..$ gender: num [1:26] 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ case  : num [1:26] 1 1 1 1 2 2 2 2 2 2 ...
  .. .. .. ..$ type  : num [1:26] 1 1 1 1 1 1 1 1 2 2 ...
  .. .. ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 1
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ featureData      :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
  .. .. ..@ varMetadata      :'data.frame': 0 obs. of  1 variable:
  .. .. .. ..$ labelDescription: chr(0) 
  .. .. ..@ data             :'data.frame': 500 obs. of  0 variables
  .. .. ..@ dimLabels        : chr [1:2] "featureNames" "featureColumns"
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 1
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ annotation       : chr "hgu95av2"
  ..@ protocolData     :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
  .. .. ..@ varMetadata      :'data.frame': 0 obs. of  1 variable:
  .. .. .. ..$ labelDescription: chr(0) 
  .. .. ..@ data             :'data.frame': 26 obs. of  0 variables
  .. .. ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 1
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. ..@ .Data:List of 4
  .. .. .. ..$ : int [1:3] 3 3 0
  .. .. .. ..$ : int [1:3] 2 32 0
  .. .. .. ..$ : int [1:3] 1 3 0
  .. .. .. ..$ : int [1:3] 1 0 0

`Accessor Functions`

The ExpressionSet is an object of a new type called 'S4'
Such objects provide accessor functions to allow us to get at their data.
The data in the 7 slots is available through the accessor functions shown earlier in the table.
Note that the covariates are listed in the @data data.frame within the @phenoData slot and that they are prefixed with a $ sign.
These elements can be reached directly as though they were elements of a list, e.g. eset$gender.

`Analysis of the First Covariate, Gender`

There are three covariates in the ExpressionSet eset.
We'll use the varMetadata accessor function to see them.

varMetadata(eset)  # show the metadata about the phenotypes

       labelDescription
gender   patient gender
case       Case/control
type         Tumor type

Let's examine the effect of gender on our set of 500 probes.

`The Plan`

First, we will explore the data interactively using box plots and running a T-test on a single row of data (one probe).
When we have that working, we will scale up to analyze all 500 probes.
The exprs() function extracts the expression matrix of 500 probes (rows) and 26 samples (columns) from eset
We'll be using exprs() to get at the data

`Partitioning by Gender`

The gender covarate column in the first data.frame in the phenoData slot of eset contains 26 entries, one per sample
- We can access gender using the $ operator
The contents of gender can be used to partition the samples into two groups.

E = exprs(eset)  # extract the exression data as a matrix

gender = eset$gender  # we can get the gender array this way

gender  # take a look

 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2

`The Plan`

We will use gender as a grouping vector for our 26 samples
We will feed boxplot a model for the expression data in the first row that says:

E[1,] is.a.function.of gender

The is.a.function.of operator is the tilde ~

`The Boxplot`

Group the samples by gender and create the boxplot
Use the ~ “is.a.function.of” symbol to define the model

boxplot(E[1, ] ~ gender)

plot of chunk unnamed-chunk-9

`A T-Test by Gender`

Use the same model for a t.test of the probe intensities

t.test(E[1, ] ~ gender)


    Welch Two Sample t-test

data:  E[1, ] by gender
t = -1.0582, df = 23.578, p-value = 0.3007
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -54.97275  17.73203
sample estimates:
mean in group 1 mean in group 2 
       105.8006        124.4210

`Scale up for all 500 Probes`

Now that we have an analysis that works for one probe, we can apply it to all 500 probes using a for loop.
This, by the way, is not the preferred method of making such comparisons but it is simple and illustrates the basic principles.

T = vector()  # initialize an array for P-Values

for (n in 1:500) {
    T[n] = t.test(E[n, ] ~ gender)$p.value
}  # do just what we did above for each probe and extract the P-value from the t.test object

`Did the T-Test Loop Work?`

length(T)  # just check to make sure we got 500 P-vals

[1] 500

head(T)  # take a look

[1] 0.3006994 0.5717812 0.9852404 0.3062817 0.6786697 0.6793440

which(T <= 0.05)  #use 'which' to see the indices of the significant cases

[1]  30  85 147 199 279 286 301 376 400

We see 9 probes with significant P-values

`How to Make 9 Boxplots: the Plan`

Let's make boxplots for these 9 interesting probes.
We first divide the graphical display into 12 areas
Next, we plot 9 boxplots and label them with probe id and P-Value

`The Code`

layout(matrix(1:9, 3, 3))

F = featureNames(eset)  # if we want to add the feature names to the graph title

for (n in which(T <= 0.05)) {
    boxplot(E[n, ] ~ gender, main = paste(F[n], format(T[n], digits = 3)))
}  # loop through all cases where the P-val <= 0.05

plot of chunk unnamed-chunk-14

`The 9 Boxplots`

plot of chunk unnamed-chunk-15

`Adding Our Own Covariate`

During the introductory session we analyzed geneData using hierarchical clustering
We found that it broke down into two distinct groups
Let's add those groups to the phenoData slot of eset and reanalyze.
First, we'll create the distance matrix needed by hclust()

# transpose the expression matrix to cluster samples (columns) rather than probes (rows)
d=dist(t(E)) 
h=hclust(d) # perform the clustering
plot(h) # plot the cluster dendrogram
abline(h=14000,col="red")

`Here's the Dendrogram`

# transpose the expression matrix to cluster samples (columns) rather than
# probes (rows)
d = dist(t(E))
h = hclust(d)  # perform the clustering
plot(h)  # plot the cluster dendrogram
abline(h = 14000, col = "red")

plot of chunk unnamed-chunk-17

`Cutting the Dendrogram`

We can cut the dendrogram at a height of 14000 to generate two groups.

groups = factor(cutree(h, h = 14000))  # cut the tree at 1400
groups  # take a look at which samples fall into which groups

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 
1 1 2 1 1 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 2 2 1 1 1 1 
Levels: 1 2

`Adding the New Covariate`

Now we are ready to add a new covariate to eset–we'll call it hc.group.

eset$hc.group = groups  # add the new covariate

varMetadata(eset)  # verify that we have done what we wanted to do

         labelDescription
gender     patient gender
case         Case/control
type           Tumor type
hc.group             <NA>

`Running the Loop`

We can now run our T-tests.

T = vector()
for (n in 1:500) {
    T[n] = t.test(E[n, ] ~ eset$hc.group)$p.value
}  # perform the 500 T-tests

Is the following code equivalent the code above?

T=array();for(n in 1:500){T[n]=t.test(E[n,]~groups)$p.value}

`How many probes pass the test?`

which(T <= 0.05)  # see which pass the test

  [1]   1   3   6   8  12  20  22  23  24  25  27  28  31  35  36  38  46
 [18]  47  48  51  58  65  67  73  74  82  86  89  91  96 106 110 117 118
 [35] 121 122 123 126 128 129 131 134 139 140 146 153 157 158 163 169 174
 [52] 177 178 180 181 184 185 186 189 192 193 196 197 201 203 204 205 207
 [69] 208 210 214 218 220 222 224 226 232 233 235 236 237 239 242 251 253
 [86] 256 262 263 264 265 266 269 270 272 275 276 281 282 288 290 291 295
[103] 296 297 298 299 302 304 306 307 309 312 313 315 316 317 320 326 327
[120] 328 329 333 334 336 341 343 344 346 351 353 355 362 363 369 373 375
[137] 379 381 383 384 387 389 390 392 393 396 397 398 403 404 405 409 410
[154] 423 426 431 432 433 436 437 441 445 446 452 455 458 460 465 467 469
[171] 470 471 474 476 482 483 485 487 494 495 499

`Getting Minimum Values`

If we want to see the minimum P-Value, we can use min()
If we want to know which probe has the lowest P-value, we can use which.min():

min(T)

[1] 3.268072e-10

which.min(T)  # Let's see the index of the probe with the best P-value

[1] 272

`Boxplotting the Best Case`

We can boxplot() this best case to see how good it can get.
We are using both the rowname and P-Value in the title
We are formatting the P=value to remove extraneous decimal places


boxplot(E[which.min(T),]~eset$hc.group,main=paste("Probe",which.min(T),"P-val:",format(T[which.min(T)],digits=3)))

`The Best Case`

boxplot(E[which.min(T), ] ~ eset$hc.group, main = paste("Probe", which.min(T), 
    "P-val:", format(T[which.min(T)], digits = 3)))

plot of chunk unnamed-chunk-19

`Getting the Probe ID for the Best Case`

What is the the probe id corresponding to the minimum P-value?

featureNames(eset[which.min(T)])

[1] "31511_at"

`Biostrings`

The Biostrings library provides a container for DNA, RNA, or protein sequences called a BStringSet. These come in three varieties:

DNA	RNA	Proteins, Peptides
DNAStringSet()	RNAStringSet()	AAStringSet()

Let's load the Biostrings library and take a look at the sample data provided:

library(Biostrings)  # load the library 'Biostrings'

`Biostrings`

Next, let's see if there is any sample data to explore:

data(package = "Biostrings")  # list the example data sets of 'Biostrings'

Data sets in package ‘Biostrings’:

BLOSUM100           Scoring matrices
BLOSUM45            Scoring matrices
BLOSUM50            Scoring matrices
BLOSUM62            Scoring matrices
BLOSUM80            Scoring matrices
HNF4alpha           Known HNF4alpha binding sequences
PAM120              Scoring matrices
PAM250              Scoring matrices
PAM30               Scoring matrices
PAM40               Scoring matrices
PAM70               Scoring matrices
phiX174Phage        Versions of bacteriophage phiX174
                    complete genome and sample short reads
...

`Biostrings`

To try out some functions found in Biostrings, load the phiX174Phage stringset.

data(phiX174Phage)  # load one of the sample data sets
phiX174Phage  # see how it looks

  A DNAStringSet instance of length 6
    width seq                                          names               
[1]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA Genbank
[2]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA RF70s
[3]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA SS78
[4]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA Bull
[5]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA G97
[6]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA NEB03

`Biostrings`

Remember that you can determine an object's class using class().

class(phiX174Phage)  # determine the object class of 'phiX174Phage'

[1] "DNAStringSet"
attr(,"package")
[1] "Biostrings"

`Biostrings`

names(phiX174Phage)

[1] "Genbank" "RF70s"   "SS78"    "Bull"    "G97"     "NEB03"

phiX174Phage$Genbank

  5386-letter "DNAString" instance
seq: GAGTTTTATCGCTTCCATGACGCAGAAGTTAACA...TTCGATAAAAATGATTGGCGTATCCAACCTGCA

`Biostrings: Creating a DNAString object`

Here we make a DNAString to contain the sequence of exon 4 of the human oncogene homologue (c-abl) of the abelson murine leukemia virus.

abl1 = DNAString("GATCTTGCTGCCCGAAACTGCCTGGTAGGGGAGAACCACTTGGTGAAGGTAGCTGATTTTGGCCTGAGCAGGTTGATGACAGGGGACACCTACACAGCCCATGCTGGAGCCAAGTTCCCCATCAAATGGACTGCACCCGAGAGCCTGGCCTACAACAAGTTCTCCATCAAGTCCGACGTCTGGGGTAAGGGC")  # use a 'constructor' to make a DNAString
abl1  # take a look at the new object

  192-letter "DNAString" instance
seq: GATCTTGCTGCCCGAAACTGCCTGGTAGGGGAGA...TTCTCCATCAAGTCCGACGTCTGGGGTAAGGGC

RNAString(abl1)  # Now, feed the new DNAString into RNAString to convert

  192-letter "RNAString" instance
seq: GAUCUUGCUGCCCGAAACUGCCUGGUAGGGGAGA...UUCUCCAUCAAGUCCGACGUCUGGGGUAAGGGC

`Biostrings: Basic Functions`

reverse(abl1)  # reverse the sequence

  192-letter "DNAString" instance
seq: CGGGAATGGGGTCTGCAGCCTGAACTACCTCTTG...GAGGGGATGGTCCGTCAAAGCCCGTCGTTCTAG

complement(abl1)  # generate the complement of the sequence

  192-letter "DNAString" instance
seq: CTAGAACGACGGGCTTTGACGGACCATCCCCTCT...AAGAGGTAGTTCAGGCTGCAGACCCCATTCCCG

reverseComplement(abl1)  # generate the reverse complement

  192-letter "DNAString" instance
seq: GCCCTTACCCCAGACGTCGGACTTGATGGAGAAC...CTCCCCTACCAGGCAGTTTCGGGCAGCAAGATC

`Biostrings: Basic Functions`

abl1.aas = translate(abl1)  # translate the sequence
abl1.aas

  64-letter "AAString" instance
seq: DLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWGKG

length(abl1.aas)

[1] 64

`Biostrings: alphabetFrequency()`

The function alphabetFrequency looks interesting, so let's try it.

a = alphabetFrequency(phiX174Phage)  # generate a table of base counts
a

        A    C    G    T M R W S Y K V H D B N - + .
[1,] 1291 1157 1254 1684 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,] 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[4,] 1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[5,] 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[6,] 1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Can we add rownames to this output?

`Biostrings: alphabetFrequency()`

names(phiX174Phage)

[1] "Genbank" "RF70s"   "SS78"    "Bull"    "G97"     "NEB03"

rownames(a) = names(phiX174Phage)

a

           A    C    G    T M R W S Y K V H D B N - + .
Genbank 1291 1157 1254 1684 0 0 0 0 0 0 0 0 0 0 0 0 0 0
RF70s   1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SS78    1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Bull    1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G97     1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
NEB03   1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0

`How do we use this matrix?`

We can comput %GC for these 6 sequences as follows:

(a[, "G"] + a[, "C"])/(a[, "G"] + a[, "C"] + a[, "A"] + a[, "T"])  # use the variable

  Genbank     RF70s      SS78      Bull       G97     NEB03 
0.4476420 0.4472707 0.4472707 0.4470850 0.4472707 0.4470850

The sequences seem to be very similar in composition. How similar?

`We can easily take a couple more steps.`

Dinucleotide counts

dinucleotideFrequency(phiX174Phage)  # get a table of dinucleotide counts

      AA  AC  AG  AT  CA  CC  CG  CT  GA  GC  GG  GT  TA  TC  TG  TT
[1,] 395 261 251 383 257 229 267 404 327 347 255 325 312 320 480 572
[2,] 394 260 253 384 258 229 265 404 327 347 254 325 313 320 480 572
[3,] 394 260 253 384 258 229 265 404 327 347 254 325 313 320 480 572
[4,] 395 260 252 384 257 229 266 403 327 346 254 326 313 320 480 573
[5,] 395 261 253 382 257 229 266 404 326 346 254 327 314 320 479 572
[6,] 394 260 253 384 258 227 265 405 327 347 254 325 313 321 480 572

Trinucleotide Counts

trinucleotideFrequency(phiX174Phage)  # get a table of trinucleotide counts

     AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA
[1,] 133  67  93 102  47  62  63  89  72  53  74  52  60  56 140 127  91
[2,] 133  66  93 102  47  62  62  89  73  53  75  52  60  56 141 127  91
[3,] 133  66  93 102  47  62  62  89  73  53  75  52  60  56 141 127  91
[4,] 133  66  93 103  47  62  62  89  73  53  74  52  60  56 140 128  91
[5,] 133  66  93 103  47  62  62  90  73  53  75  52  60  56 139 127  91
[6,] 133  66  93 102  47  62  62  89  73  53  75  52  60  56 141 127  91
     CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC
[1,]  40  61  64  59  38  70  62  52  92  31  92  78  67 132 127  70  82
[2,]  40  62  64  59  38  70  62  52  92  30  91  78  67 132 127  69  82
[3,]  40  62  64  59  38  70  62  52  92  30  91  78  67 132 127  69  82
[4,]  39  62  64  58  38  71  62  52  92  30  92  78  67 132 126  70  82
[5,]  40  62  63  58  38  71  62  52  92  30  92  78  67 132 127  70  82
[6,]  40  62  64  59  36  69  63  52  92  30  91  78  68 132 127  69  82
     GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG
[1,]  74 101  60  71  67 149  63  72  20 100  54  65  66 140 101  72  23
[2,]  74 102  61  71  66 149  63  72  19 100  55  65  65 140 101  72  24
[3,]  74 102  61  71  66 149  63  72  19 100  55  65  65 140 101  72  24
[4,]  74 101  61  71  66 148  62  72  20 100  54  65  67 140 101  73  23
[5,]  74 100  61  71  66 148  62  72  19 101  55  65  67 140 101  73  24
[6,]  74 102  61  71  66 149  63  72  19 100  55  65  65 140 101  72  24
     TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT
[1,] 116  91  58  67 104 139 130 130  81 120 132 142 178
[2,] 116  91  58  67 104 138 130 130  82 120 132 142 178
[3,] 116  91  58  67 104 138 130 130  82 120 132 142 178
[4,] 116  91  58  67 104 139 129 130  82 121 132 141 179
[5,] 116  91  58  67 104 138 129 130  82 121 132 141 178
[6,] 116  91  58  68 104 138 130 130  82 120 132 142 178

Oligonucleotide counts

O = oligonucleotideFrequency(phiX174Phage, 5)
hist(O)

plot of chunk unnamed-chunk-34

`BSgenome`

The BSgenome library provides a container for genomes. A number of genomes have been packaged for Bioconductor.

library(BSgenome)  # load the library 'BSgenome'
available.genomes()[1:20]  # get a list of genomes you can install

 [1] "BSgenome.Alyrata.JGI.v1"                 
 [2] "BSgenome.Amellifera.BeeBase.assembly4"   
 [3] "BSgenome.Amellifera.UCSC.apiMel2"        
 [4] "BSgenome.Amellifera.UCSC.apiMel2.masked" 
 [5] "BSgenome.Athaliana.TAIR.04232008"        
 [6] "BSgenome.Athaliana.TAIR.TAIR9"           
 [7] "BSgenome.Btaurus.UCSC.bosTau3"           
 [8] "BSgenome.Btaurus.UCSC.bosTau3.masked"    
 [9] "BSgenome.Btaurus.UCSC.bosTau4"           
[10] "BSgenome.Btaurus.UCSC.bosTau4.masked"    
[11] "BSgenome.Btaurus.UCSC.bosTau6"           
[12] "BSgenome.Btaurus.UCSC.bosTau6.masked"    
[13] "BSgenome.Btaurus.UCSC.bosTau8"           
[14] "BSgenome.Celegans.UCSC.ce10"             
[15] "BSgenome.Celegans.UCSC.ce11"             
[16] "BSgenome.Celegans.UCSC.ce2"              
[17] "BSgenome.Celegans.UCSC.ce6"              
[18] "BSgenome.Cfamiliaris.UCSC.canFam2"       
[19] "BSgenome.Cfamiliaris.UCSC.canFam2.masked"
[20] "BSgenome.Cfamiliaris.UCSC.canFam3"

Let's load the C. elegans genome to experiment.

library(BSgenome.Celegans.UCSC.ce6)  # load your local copy of the C. elegans genome

What can we do with BSgenomes?

`Use ls("package:BSgenome") to see`

ls("package:BSgenome")[1:20]  # list the methods of 'BSgenome'

 [1] "alphabetFrequency"           "as.data.frame"              
 [3] "as.data.frame.BSgenomeViews" "as.list"                    
 [5] "available.genomes"           "available.SNPs"             
 [7] "blocksizes"                  "breakpoints"                
 [9] "bsapply"                     "BSgenome"                   
[11] "BSgenomeViews"               "coerce"                     
[13] "colnames"                    "commonName"                 
[15] "compatibleGenomes"           "consensusMatrix"            
[17] "consensusString"             "countPWM"                   
[19] "elementNROWS"                "end"

BSgenome.Celegans.UCSC.ce6  # see some header information

Worm genome:
# organism: Caenorhabditis elegans (Worm)
# provider: UCSC
# provider version: ce6
# release date: May 2008
# release name: WormBase v. WS190
# 7 sequences:
#   chrI   chrII  chrIII chrIV  chrV   chrX   chrM                       
# (use 'seqnames()' to see all the sequence names, use the '$' or '[['
# operator to access a given sequence)

BSgenome.Celegans.UCSC.ce6[["chrI"]]  # display chromosome I

  15072421-letter "DNAString" instance
seq: GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT...GGCTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC

seqinfo(BSgenome.Celegans.UCSC.ce6)  # get a summary of the sequences included

Seqinfo object with 7 sequences (1 circular) from ce6 genome:
  seqnames seqlengths isCircular genome
  chrI       15072421      FALSE    ce6
  chrII      15279323      FALSE    ce6
  chrIII     13783681      FALSE    ce6
  chrIV      17493785      FALSE    ce6
  chrV       20919568      FALSE    ce6
  chrX       17718854      FALSE    ce6
  chrM          13794       TRUE    ce6

seqnames(BSgenome.Celegans.UCSC.ce6)  # get only the seq names

[1] "chrI"   "chrII"  "chrIII" "chrIV"  "chrV"   "chrX"   "chrM"

seqlengths(BSgenome.Celegans.UCSC.ce6)  # get an array of the sequence lengths

    chrI    chrII   chrIII    chrIV     chrV     chrX     chrM 
15072421 15279323 13783681 17493785 20919568 17718854    13794

barplot(seqlengths(BSgenome.Celegans.UCSC.ce6))  # generate a quick barplot

plot of chunk unnamed-chunk-37

What is the GC content of the C. elegans genome chromosome I?

F = alphabetFrequency(BSgenome.Celegans.UCSC.ce6[["chrI"]])  # get the frequency table

F  # take look at the table

      A       C       G       T       M       R       W       S       Y 
4835939 2695879 2692150 4848453       0       0       0       0       0 
      K       V       H       D       B       N       -       +       . 
      0       0       0       0       0       0       0       0       0

sum(F[c("G", "C")])/sum(F[c("A", "T", "C", "G")]) * 100  # compute % GC

[1] 35.7476

Exercise: How can we comput %GC for all chromosomes?

Fr = vector()  # initialize a vector to hold the %GC values

for (s in seqnames(BSgenome.Celegans.UCSC.ce6)) {
    # loop through each of the 'seqnames'--these are the chromosomes
    F = alphabetFrequency(BSgenome.Celegans.UCSC.ce6[[s]])
    # get a table for one chromosome,compute %GC and store this in 'Fr[s]'
    Fr[s] = sum(F[c("G", "C")])/sum(F[c("A", "T", "C", "G")]) * 100
}


Fr  # display Fr to see if we got it right

    chrI    chrII   chrIII    chrIV     chrV     chrX     chrM 
35.74760 36.20196 35.66144 34.59386 35.42942 35.20301 23.77845

barplot(Fr, ylab = "%GC", xlab = "C. elegans Chromosome", main = "GC Percentage by Chromosome")  # barplot it!

plot of chunk unnamed-chunk-40

`GenomicFeatures`

The GenomicFeatures library provides the functions for the extraction of the sequences of features such as genes, exons, introns, and promoters from BSgenomes.

Let's take a look at the C. elegans genome to start.

library(TxDb.Celegans.UCSC.ce6.ensGene)  # load the C. elegans transcript database

By loading this transcript database, we have also loaded a new library called GenomicFeatures that is required to use the database. What new functions have we acquired?

ls("package:GenomicFeatures")[1:40]  # list the methods of 'GenomicFeatures'

 [1] "as.list"                    "asBED"                     
 [3] "asGFF"                      "cds"                       
 [5] "cdsBy"                      "cdsByOverlaps"             
 [7] "coverageByTranscript"       "DEFAULT_CIRC_SEQS"         
 [9] "disjointExons"              "distance"                  
[11] "exons"                      "exonsBy"                   
[13] "exonsByOverlaps"            "extractTranscriptSeqs"     
[15] "extractUpstreamSeqs"        "features"                  
[17] "fiveUTRsByTranscript"       "genes"                     
[19] "getChromInfoFromBiomart"    "getChromInfoFromUCSC"      
[21] "getPromoterSeq"             "id2name"                   
[23] "intronsByTranscript"        "isActiveSeq"               
[25] "isActiveSeq<-"              "makeFDbPackageFromUCSC"    
[27] "makeFeatureDbFromUCSC"      "makePackageName"           
[29] "makeTxDb"                   "makeTxDbFromBiomart"       
[31] "makeTxDbFromGFF"            "makeTxDbFromGRanges"       
[33] "makeTxDbFromUCSC"           "makeTxDbPackage"           
[35] "makeTxDbPackageFromBiomart" "makeTxDbPackageFromUCSC"   
[37] "mapFromTranscripts"         "mapIdsToRanges"            
[39] "mapRangesToIds"             "mapToTranscripts"

The functions exons(), genes(), and exons.By() look interesting so let's try them out:

ce.exons = exons(TxDb.Celegans.UCSC.ce6.ensGene)  # get the exons coordinates
head(ce.exons, 2)  # take a look

GRanges object with 2 ranges and 1 metadata column:
      seqnames         ranges strand |   exon_id
         <Rle>      <IRanges>  <Rle> | <integer>
  [1]     chrI [11495, 11561]      + |         1
  [2]     chrI [11499, 11557]      + |         2
  -------
  seqinfo: 7 sequences (1 circular) from ce6 genome

ce.genes = genes(TxDb.Celegans.UCSC.ce6.ensGene)  # get the gene coordinates
head(ce.genes, 2)  #take a look

GRanges object with 2 ranges and 1 metadata column:
          seqnames               ranges strand |     gene_id
             <Rle>            <IRanges>  <Rle> | <character>
   2L52.1    chrII [    1867,     4663]      + |      2L52.1
  2RSSE.1    chrII [15268114, 15273216]      + |     2RSSE.1
  -------
  seqinfo: 7 sequences (1 circular) from ce6 genome

The gene_ids in the first metadata column are ENSEMBL ids and can be used to search in the UCSC Genome Browser.

# get the exons with their gene assignments
ce.exons.bygene = exonsBy(TxDb.Celegans.UCSC.ce6.ensGene, by = c("gene"))
head(ce.exons.bygene, 1)  # take a look

GRangesList object of length 1:
$2L52.1 
GRanges object with 7 ranges and 2 metadata columns:
      seqnames       ranges strand |   exon_id   exon_name
         <Rle>    <IRanges>  <Rle> | <integer> <character>
  [1]    chrII [1867, 1911]      + |     21558        <NA>
  [2]    chrII [2506, 2694]      + |     21559        <NA>
  [3]    chrII [2738, 2888]      + |     21560        <NA>
  [4]    chrII [2931, 3036]      + |     21561        <NA>
  [5]    chrII [3406, 3552]      + |     21562        <NA>
  [6]    chrII [3802, 3984]      + |     21563        <NA>
  [7]    chrII [4201, 4663]      + |     21564        <NA>

-------
seqinfo: 7 sequences (1 circular) from ce6 genome

`GenomicRanges`

The GenomicRanges library provides the framework for feature extraction and analysis. Among its most powerful functions are those used ot determine feature overlap. We load the library as usual:

library("GenomicRanges")  # load the library 'GenomicRanges'

`And we list the newly acquired functions in the usual manner:`

ls("package:GenomicRanges")[1:20]  # list the methods of 'GenomicRanges'

 [1] "absoluteRanges"           "as.data.frame"           
 [3] "as.factor"                "bindAsGRanges"           
 [5] "binnedAverage"            "checkConstraint"         
 [7] "coerce"                   "countOverlaps"           
 [9] "coverage"                 "disjoin"                 
[11] "disjointBins"             "distance"                
[13] "distanceToNearest"        "duplicated"              
[15] "duplicated.GenomicRanges" "elementMetadata"         
[17] "elementMetadata<-"        "end"                     
[19] "end<-"                    "findOverlaps"

`Let's try out a few`

length(ce.exons)  # how many exons are there in C. elegans?

[1] 146833

head(width(ce.exons))  # get the lengths for the first 6 exons

[1] 67 59 63 57 72 67

`Take a free histogram`

hist(width(ce.exons), main = "Distribution of Exon Lengths for C. Elegans")

plot of chunk unnamed-chunk-46

The histogram looks a bit cramped. How can we correct this?

hist(log10(width(ce.exons)), main = "Distribution of Exon Lengths for C. Elegans")

plot of chunk unnamed-chunk-47

`GenomicAlignments`

The GenomicAlignments library provides containers for storing and manipulating short genomic alignments (typically obtained by aligning short reads to a reference genome). This includes read counting, computing the coverage, junction detection, and working with the nucleotide sequences of the alignments.

library(GenomicAlignments)

Exercise: What new functions have we acquired?

bam <- readGAlignments(system.file("extdata", "ex1.bam", package = "Rsamtools"))

bam[1:2]

GAlignments object with 2 alignments and 0 metadata columns:
      seqnames strand       cigar    qwidth     start       end     width
         <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
  [1]     seq1      +         36M        36         1        36        36
  [2]     seq1      +         35M        35         3        37        35
          njunc
      <integer>
  [1]         0
  [2]         0
  -------
  seqinfo: 2 sequences from an unspecified genome

B7_591:4:96:693:509     73      seq1    1       99      36M     *       0       0
       CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG    <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7
MF:i:18 Aq:i:73 NM:i:0  UQ:i:0  H0:i:1  H1:i:0
EAS54_65:7:152:368:113  73      seq1    3       99      35M     *       0       0
       CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT     <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6):     MF:i:18 Aq:i:66 NM:i:0  UQ:i:0  H0:i:1  H1:i:0

`Cigar Strings`

RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  G  C  T
With the alignment above, you get:
POS: 5
CIGAR: 3M1I3M1D5M

sort(table(cigar(bam)), decreasing = TRUE)


     35M      36M      40M      34M      33M 14M4I17M 15M4I16M 16M4I15M 
    2804      283      112       37        6        4        4        3 
 7M4I24M 17M4I14M  9M2I24M 10M4I21M 11M5I19M 12M4I19M 13M1D22M 13M4I18M 
       3        2        2        1        1        1        1        1 
14M5I17M 18M4I13M 18M5I12M  7M2I31M  8M4I24M  9M1D26M 
       1        1        1        1        1        1

http://genome.sph.umich.edu/wiki/SAM#What_is_a_CIGAR.3F

`Counting Overlaps Between Features`

Overlaps between genomic features can be computed using findOverlaps.
Let's first count overlaps between exons in our ce.exons object to illustrate basic principles

`Overlap the first exon with the others`

ce.exons[1:5]

GRanges object with 5 ranges and 1 metadata column:
      seqnames         ranges strand |   exon_id
         <Rle>      <IRanges>  <Rle> | <integer>
  [1]     chrI [11495, 11561]      + |         1
  [2]     chrI [11499, 11557]      + |         2
  [3]     chrI [11499, 11561]      + |         3
  [4]     chrI [11505, 11561]      + |         4
  [5]     chrI [11618, 11689]      + |         5
  -------
  seqinfo: 7 sequences (1 circular) from ce6 genome

findOverlaps(ce.exons[1], ce.exons[-1], ignore.strand = TRUE)

Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         1           2
  [3]         1           3
  -------
  queryLength: 1 / subjectLength: 146832

`Distribution of # overlaps for first 100 exons`

E = array()
for (i in 1:100) {
    F = findOverlaps(ce.exons[i], ce.exons[-i], ignore.strand = TRUE)
    E[i] = length(queryHits(F))
}
hist(E)

plot of chunk unnamed-chunk-52

`Palindromes witin exons: the Plan`

Find palindromes on chr1
Format result for findOverlaps
Get overlaps with exons

P = findPalindromes(BSgenome.Celegans.UCSC.ce6$chrI)
length(P)

[1] 213154

`The palindromes`

Are the two objects compatible?

  Views on a 15072421-letter DNAString subject
subject: GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGC...CTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC
views:
            start      end width
     [1]      460      468     9 [TTTTCAAAA]
     [2]      488      497    10 [AAAACGTTTT]
     [3]      507      514     8 [GAAGCTTC]
     [4]      573      582    10 [TTAGGCCTAA]
     [5]      856      863     8 [GTAGCTAC]
     ...      ...      ...   ... ...
[213150] 15071791 15071798     8 [AAAATTTT]
[213151] 15071985 15071993     9 [GTCTGAGAC]
[213152] 15072046 15072053     8 [TAGTACTA]
[213153] 15072053 15072064    12 [AGTGCCGGCACT]
[213154] 15072085 15072093     9 [GAATGATTC]

class(ce.exons)

[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"

`Converting to GRanges`

Use the GRanges constructor function

Pg = GRanges(seq = "chrI", IRanges(start = start(P), end = end(P)))

Pg

GRanges object with 213154 ranges and 0 metadata columns:
           seqnames               ranges strand
              <Rle>            <IRanges>  <Rle>
       [1]     chrI           [460, 468]      *
       [2]     chrI           [488, 497]      *
       [3]     chrI           [507, 514]      *
       [4]     chrI           [573, 582]      *
       [5]     chrI           [856, 863]      *
       ...      ...                  ...    ...
  [213150]     chrI [15071791, 15071798]      *
  [213151]     chrI [15071985, 15071993]      *
  [213152]     chrI [15072046, 15072053]      *
  [213153]     chrI [15072053, 15072064]      *
  [213154]     chrI [15072085, 15072093]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

`Finding the overlaps`

F = findOverlaps(Pg, ce.exons)
length(F)

[1] 46436

`Selecting palindromes by width`

Maybe longer palindromes are underrepresented within exons
The histogram suggests a cutoff of 10^1.2 (15) might define a subset of long palindromes

hist(log10(width(P)))

plot of chunk unnamed-chunk-57

`There are 5889 long palindromes on chr1`

Here we subset the P “XStringViews” object
Then, create a new GenomicRanges object from its coordinate list

P15 = P[width(P) > 15]
Pg15 = GRanges(seq = "chrI", IRanges(start = start(P15), end = end(P15)))
length(Pg15)

[1] 5889

`Long palidromes seem to avoid exons`

5% of long palindromes overlap with exons
22% of all palindromes overlap with exons

F15 = findOverlaps(Pg15, ce.exons)
length(F15)

[1] 310

length(F15)/length(Pg15)

[1] 0.05264052

length(F)/length(Pg)

[1] 0.2178519