Introduction to Bioconductor for Biologists

David Wheeler, PhD, Laboratory of Biochemistry and Molecular Biology, NCI
11/22/2016

What is Bioconductor?

“Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. It has two releases each year, 1296 software packages, and an active user community.”

Browse packages at:

http://bioconductor.org/packages/release/BiocViews.html

Bioconductor Packages by Category

Software (1289) AnnotationData (933) ExperimentData (305)
AssayDomain (485) ChipManufacturer (403) Cancer (72)
BiologicalQuestion (459) ChipName (195) MicrobiomeData (2)
Infrastructure (230) CustomArray (2) RNASeqData (23)
ResearchField (193) CustomCDF (16) ChIPseqData (8)
StatisticalMethod (318) Organism (600) EColiData (2)
Technology (696) FunctionalAnnotation (13) ExpressionData (7)
WorkflowStep (675) HapMap (7)

Installing Bioconductor

# source the R coded installer
source("http://www.bioconductor.org/biocLite.R")

biocLite() # run the installer to get the basic library

biocLite("GenomicRanges") # install a specific library

Loading Biobase Generates a Report

library(Biobase)  #load Biobase
  • In addition to Biobase, two other packages upon which Biobase depends are automatically loaded. These are BiocGenerics and parallel.

  • Some functions from other packages are “masked”. This means that a recently loaded library defines a function with the same name as that of a previously loaded library–the new function supercedes the old.

  • Vignettes for a package take you through a typical analysis and are a good way to get acquainted with the package.

Biobase comes with Sample Data


data(package="Biobase")  

Data sets in package ‘Biobase’:

SW                      Class to Contain High-Throughput Assays and
                        Experimental Metadata
aaMap                   Dataset: Names and Characteristics of Amino
                        Acids
geneCov                 Sample expression matrix and phenotype
                        data.frames.
geneCovariate           Sample expression matrix and phenotype
                        data.frames.
geneData                Sample expression matrix and phenotype
                        data.frames.
reporter                Example data.frame representing reporter
                        information
sample.ExpressionSet    Dataset of class 'ExpressionSet'
sample.MultiSet         Data set of class 'MultiSet'
seD                     Sample expression matrix and phenotype
                        data.frames.

Let's Load a Small Dataset...

data("aaMap")  # load 'aaMap'

aaMap[1:2, ]  # take a look
      name let.1 let.3   scProp hyPhilic acidic
1  alanine     A   ala nonpolar    FALSE     NA
2 cysteine     C   cys    polar       NA     NA
table(aaMap$scProp)  # tabulate side chain properties

nonpolar    polar 
       9       11 

Partial Listing Biobase Functions

  • There are over 120 function listed and many of them have to do with microarray analysis.
ls("package:Biobase")[c(6, 12, 125, 101, 89, 91, 71, 49, 51, 45, 41, 44)]  # get a listing of some of the methods of Biobase
 [1] "AnnotatedDataFrame" "assayData"          "varMetadata"       
 [4] "sampleNames"        "protocolData"       "pubMedIds"         
 [7] "MIAME"              "featureData"        "featureNames"      
[10] "exprs"              "experimentData"     "ExpressionSet"     

The ExpressionSet

  • Required format for most functions dealing with microarray data

  • Objects with very similar structure used for other types of data (e.g. SummarizedExperiment for NGS data, MRexperiment for microbiome data)

  • So, let's build one from basic components

Constructing an ExpressionSet

  • The ExpressionSet is used to contain microarray data.
  • It is comprised of 7 components, called slots, of the types and content listed.
  • The table headings give the accessor functions used to access the data in the slots.
assayData (exprs) phenoData experimentData annotation
environment AnnotatedDataFrame MIAME object character
expression data sample information description of experiment name the matching annotation database
featureData protocolData classVersion
AnnotatedDataFrame AnnotatedDataFrame Versions
feature information protocol information

The Plan

  • We'll draw from the sample data available within Biobase to build an ExpressionSet from scratch using 4 of the 7 slots
  • We will omit the optional featureData and protocolData for the sake of brevity.
  • The classVersion slot is filled in automatically and need not concern us.

First Component: Expression Matrix

  • We will use geneData for the first component.
data(geneData)  # a sample expression matrix
colnames(geneData)  #  the sample names
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
geneData[1:2, 1:3]  # the matrix itself
                      A        B        C
AFFX-MurIL2_at  192.742  85.7533 176.7570
AFFX-MurIL10_at  97.137 126.1960  77.9216

Second Component: Phenotype Data

  • We need some phenotype data that we can use to classify the samples.
  • This will be the first of two data.frames we use to build the AnnotatedDataFrame of the phenoData slot.
data(geneCov)  # the sample information

head(geneCov)  # take a look at what we have
  cov1 cov2 cov3
A    1    1    1
B    1    1    1
C    1    1    1
D    1    1    1
E    1    2    1
F    1    2    1

How can we change cov1 to 'gender', cov2 to “case”, and cov3 to “type”?

Naming the Covariates

  • We can use the names() function to assign new names to the covariates
names(geneCov)
[1] "cov1" "cov2" "cov3"
names(geneCov) = c("gender", "case", "type")
names(geneCov)
[1] "gender" "case"   "type"  

Annotating the Phenodata

  • We can describe the columns by adding a second data.frame to the phenotype data
  • The first argument to the data.frame constructor is adding 3 columns to the frame by name while the second is labeling the rows.
# Here we describe the columns--this is the second data.frame of phenData

metadata = data.frame(labelDescription = c("patient gender", "Case/control", 
    "Tumor type"), row.names = c("gender", "case", "type"))

metadata
       labelDescription
gender   patient gender
case       Case/control
type         Tumor type

The Phenodata Constructor

  • Now we can construct the phenoData component which is an AnnotatedDataFrame
  • An Annotated DataFrame consists of one data.frame to hold the covariate data and a second that describes the covariates.
  • The new() function is a constructor, taking as its first argument the type of object to construct.
# construct the `phenoData` component.

phenoData = new("AnnotatedDataFrame", data = geneCov, varMetadata = metadata)
phenoData
An object of class 'AnnotatedDataFrame'
  rowNames: A B ... Z (26 total)
  varLabels: gender case type
  varMetadata: labelDescription

Third Component: Annotation

  • This is simply a reference to the hgu95av2 annotation database.
  • The database used is the Affymetrix Human Genome U95 Set annotation data assembled from public repositories
  • It provides mappings between probe identifiers and biological data

    hgu95av2ACCNUM hgu95av2CHR hgu95av2CHRLOC hgu95av2ENZYME
    hgu95av2GENENAME hgu95av2GO hgu95av2MAP hgu95av2OMIM
    hgu95av2PFAM hgu95av2PMID hgu95av2PROSITE hgu95av2REFSEQ
    hgu95av2UNIGENE hgu95av2CHRLENGTHS hgu95av2ENZYME2PROBE hgu95av2GO2ALLPROBES
    hgu95av2PATH2PROBE hgu95av2PMID2PROBE

This is an Easy One

annotation = "hgu95av2"  # add a reference to our annotation database

The MIAMI Component

  • The final component is a description of the experiment, following the MIAMI (Minimum Information About a Microarray Experiment) specification.
experimentData = new("MIAME", name = "C. Darwin", lab = "CCR", contact = "email.gov", 
    title = "Small Experiment", abstract = "An example ExpressionSet", url = "www.lab.gov")

Putting it all Together

Now, we can build the ExpressionSet from the 4 components using the ExpressionSet() constructor.

eset = ExpressionSet(assayData = geneData, phenoData = phenoData, experimentData = experimentData, 
    annotation = annotation)
str(eset)

Formal class 'ExpressionSet' [package "Biobase"] with 7 slots
  ..@ experimentData   :Formal class 'MIAME' [package "Biobase"] with 13 slots
  .. .. ..@ name             : chr "C. Darwin"
  .. .. ..@ lab              : chr "CCR"
  .. .. ..@ contact          : chr "email.gov"
  .. .. ..@ title            : chr "Small Experiment"
  .. .. ..@ abstract         : chr "An example ExpressionSet"
  .. .. ..@ url              : chr "www.lab.gov"
  .. .. ..@ pubMedIds        : chr ""
  .. .. ..@ samples          : list()
  .. .. ..@ hybridizations   : list()
  .. .. ..@ normControls     : list()
  .. .. ..@ preprocessing    : list()
  .. .. ..@ other            : list()
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 2
  .. .. .. .. .. ..$ : int [1:3] 1 0 0
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ assayData        :<environment: 0x7f94fbb63378> 
  ..@ phenoData        :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
  .. .. ..@ varMetadata      :'data.frame': 3 obs. of  1 variable:
  .. .. .. ..$ labelDescription: chr [1:3] "patient gender" "Case/control" "Tumor type"
  .. .. ..@ data             :'data.frame': 26 obs. of  3 variables:
  .. .. .. ..$ gender: num [1:26] 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ case  : num [1:26] 1 1 1 1 2 2 2 2 2 2 ...
  .. .. .. ..$ type  : num [1:26] 1 1 1 1 1 1 1 1 2 2 ...
  .. .. ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 1
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ featureData      :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
  .. .. ..@ varMetadata      :'data.frame': 0 obs. of  1 variable:
  .. .. .. ..$ labelDescription: chr(0) 
  .. .. ..@ data             :'data.frame': 500 obs. of  0 variables
  .. .. ..@ dimLabels        : chr [1:2] "featureNames" "featureColumns"
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 1
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ annotation       : chr "hgu95av2"
  ..@ protocolData     :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
  .. .. ..@ varMetadata      :'data.frame': 0 obs. of  1 variable:
  .. .. .. ..$ labelDescription: chr(0) 
  .. .. ..@ data             :'data.frame': 26 obs. of  0 variables
  .. .. ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 1
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. ..@ .Data:List of 4
  .. .. .. ..$ : int [1:3] 3 3 0
  .. .. .. ..$ : int [1:3] 2 32 0
  .. .. .. ..$ : int [1:3] 1 3 0
  .. .. .. ..$ : int [1:3] 1 0 0

Accessor Functions

  • The ExpressionSet is an object of a new type called 'S4'
  • Such objects provide accessor functions to allow us to get at their data.

  • The data in the 7 slots is available through the accessor functions shown earlier in the table.

  • Note that the covariates are listed in the @data data.frame within the @phenoData slot and that they are prefixed with a $ sign.

  • These elements can be reached directly as though they were elements of a list, e.g. eset$gender.

Analysis of the First Covariate, Gender

  • There are three covariates in the ExpressionSet eset.
  • We'll use the varMetadata accessor function to see them.
varMetadata(eset)  # show the metadata about the phenotypes
       labelDescription
gender   patient gender
case       Case/control
type         Tumor type

Let's examine the effect of gender on our set of 500 probes.

The Plan

  • First, we will explore the data interactively using box plots and running a T-test on a single row of data (one probe).
  • When we have that working, we will scale up to analyze all 500 probes.
  • The exprs() function extracts the expression matrix of 500 probes (rows) and 26 samples (columns) from eset
  • We'll be using exprs() to get at the data

Partitioning by Gender

  • The gender covarate column in the first data.frame in the phenoData slot of eset contains 26 entries, one per sample
    • We can access gender using the $ operator
  • The contents of gender can be used to partition the samples into two groups.
E = exprs(eset)  # extract the exression data as a matrix

gender = eset$gender  # we can get the gender array this way

gender  # take a look
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2

The Plan

  • We will use gender as a grouping vector for our 26 samples
  • We will feed boxplot a model for the expression data in the first row that says:

E[1,] is.a.function.of gender

The is.a.function.of operator is the tilde ~

The Boxplot

  • Group the samples by gender and create the boxplot
  • Use the ~ “is.a.function.of” symbol to define the model
boxplot(E[1, ] ~ gender)

plot of chunk unnamed-chunk-9

A T-Test by Gender

  • Use the same model for a t.test of the probe intensities
t.test(E[1, ] ~ gender)

    Welch Two Sample t-test

data:  E[1, ] by gender
t = -1.0582, df = 23.578, p-value = 0.3007
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -54.97275  17.73203
sample estimates:
mean in group 1 mean in group 2 
       105.8006        124.4210 

Scale up for all 500 Probes

  • Now that we have an analysis that works for one probe, we can apply it to all 500 probes using a for loop.
  • This, by the way, is not the preferred method of making such comparisons but it is simple and illustrates the basic principles.
T = vector()  # initialize an array for P-Values

for (n in 1:500) {
    T[n] = t.test(E[n, ] ~ gender)$p.value
}  # do just what we did above for each probe and extract the P-value from the t.test object

Did the T-Test Loop Work?

length(T)  # just check to make sure we got 500 P-vals
[1] 500
head(T)  # take a look
[1] 0.3006994 0.5717812 0.9852404 0.3062817 0.6786697 0.6793440
which(T <= 0.05)  #use 'which' to see the indices of the significant cases
[1]  30  85 147 199 279 286 301 376 400
  • We see 9 probes with significant P-values

How to Make 9 Boxplots: the Plan

  • Let's make boxplots for these 9 interesting probes.

  • We first divide the graphical display into 12 areas

  • Next, we plot 9 boxplots and label them with probe id and P-Value

The Code

layout(matrix(1:9, 3, 3))

F = featureNames(eset)  # if we want to add the feature names to the graph title

for (n in which(T <= 0.05)) {
    boxplot(E[n, ] ~ gender, main = paste(F[n], format(T[n], digits = 3)))
}  # loop through all cases where the P-val <= 0.05

plot of chunk unnamed-chunk-14

The 9 Boxplots

plot of chunk unnamed-chunk-15

Adding Our Own Covariate

  • During the introductory session we analyzed geneData using hierarchical clustering
  • We found that it broke down into two distinct groups
  • Let's add those groups to the phenoData slot of eset and reanalyze.

  • First, we'll create the distance matrix needed by hclust()

# transpose the expression matrix to cluster samples (columns) rather than probes (rows)
d=dist(t(E)) 
h=hclust(d) # perform the clustering
plot(h) # plot the cluster dendrogram
abline(h=14000,col="red")

Here's the Dendrogram

# transpose the expression matrix to cluster samples (columns) rather than
# probes (rows)
d = dist(t(E))
h = hclust(d)  # perform the clustering
plot(h)  # plot the cluster dendrogram
abline(h = 14000, col = "red")

plot of chunk unnamed-chunk-17

Cutting the Dendrogram

  • We can cut the dendrogram at a height of 14000 to generate two groups.
groups = factor(cutree(h, h = 14000))  # cut the tree at 1400
groups  # take a look at which samples fall into which groups
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 
1 1 2 1 1 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 2 2 1 1 1 1 
Levels: 1 2

Adding the New Covariate

  • Now we are ready to add a new covariate to eset–we'll call it hc.group.
eset$hc.group = groups  # add the new covariate

varMetadata(eset)  # verify that we have done what we wanted to do
         labelDescription
gender     patient gender
case         Case/control
type           Tumor type
hc.group             <NA>

Running the Loop

  • We can now run our T-tests.
T = vector()
for (n in 1:500) {
    T[n] = t.test(E[n, ] ~ eset$hc.group)$p.value
}  # perform the 500 T-tests
  • Is the following code equivalent the code above?
T=array();for(n in 1:500){T[n]=t.test(E[n,]~groups)$p.value}

How many probes pass the test?

which(T <= 0.05)  # see which pass the test
  [1]   1   3   6   8  12  20  22  23  24  25  27  28  31  35  36  38  46
 [18]  47  48  51  58  65  67  73  74  82  86  89  91  96 106 110 117 118
 [35] 121 122 123 126 128 129 131 134 139 140 146 153 157 158 163 169 174
 [52] 177 178 180 181 184 185 186 189 192 193 196 197 201 203 204 205 207
 [69] 208 210 214 218 220 222 224 226 232 233 235 236 237 239 242 251 253
 [86] 256 262 263 264 265 266 269 270 272 275 276 281 282 288 290 291 295
[103] 296 297 298 299 302 304 306 307 309 312 313 315 316 317 320 326 327
[120] 328 329 333 334 336 341 343 344 346 351 353 355 362 363 369 373 375
[137] 379 381 383 384 387 389 390 392 393 396 397 398 403 404 405 409 410
[154] 423 426 431 432 433 436 437 441 445 446 452 455 458 460 465 467 469
[171] 470 471 474 476 482 483 485 487 494 495 499

Getting Minimum Values

  • If we want to see the minimum P-Value, we can use min()
  • If we want to know which probe has the lowest P-value, we can use which.min():
min(T)
[1] 3.268072e-10
which.min(T)  # Let's see the index of the probe with the best P-value
[1] 272

Boxplotting the Best Case

  • We can boxplot() this best case to see how good it can get.
  • We are using both the rowname and P-Value in the title
  • We are formatting the P=value to remove extraneous decimal places

boxplot(E[which.min(T),]~eset$hc.group,main=paste("Probe",which.min(T),"P-val:",format(T[which.min(T)],digits=3)))

The Best Case

boxplot(E[which.min(T), ] ~ eset$hc.group, main = paste("Probe", which.min(T), 
    "P-val:", format(T[which.min(T)], digits = 3)))

plot of chunk unnamed-chunk-19

Getting the Probe ID for the Best Case

  • What is the the probe id corresponding to the minimum P-value?
featureNames(eset[which.min(T)])
[1] "31511_at"

Biostrings

The Biostrings library provides a container for DNA, RNA, or protein sequences called a BStringSet. These come in three varieties:

DNA RNA Proteins, Peptides
DNAStringSet() RNAStringSet() AAStringSet()

Let's load the Biostrings library and take a look at the sample data provided:

library(Biostrings)  # load the library 'Biostrings'

Biostrings

Next, let's see if there is any sample data to explore:

data(package = "Biostrings")  # list the example data sets of 'Biostrings'
Data sets in package ‘Biostrings’:

BLOSUM100           Scoring matrices
BLOSUM45            Scoring matrices
BLOSUM50            Scoring matrices
BLOSUM62            Scoring matrices
BLOSUM80            Scoring matrices
HNF4alpha           Known HNF4alpha binding sequences
PAM120              Scoring matrices
PAM250              Scoring matrices
PAM30               Scoring matrices
PAM40               Scoring matrices
PAM70               Scoring matrices
phiX174Phage        Versions of bacteriophage phiX174
                    complete genome and sample short reads
...

Biostrings

To try out some functions found in Biostrings, load the phiX174Phage stringset.

data(phiX174Phage)  # load one of the sample data sets
phiX174Phage  # see how it looks
  A DNAStringSet instance of length 6
    width seq                                          names               
[1]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA Genbank
[2]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA RF70s
[3]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA SS78
[4]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA Bull
[5]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA G97
[6]  5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA NEB03

Biostrings

Remember that you can determine an object's class using class().

class(phiX174Phage)  # determine the object class of 'phiX174Phage'
[1] "DNAStringSet"
attr(,"package")
[1] "Biostrings"

Biostrings

names(phiX174Phage)
[1] "Genbank" "RF70s"   "SS78"    "Bull"    "G97"     "NEB03"  
phiX174Phage$Genbank
  5386-letter "DNAString" instance
seq: GAGTTTTATCGCTTCCATGACGCAGAAGTTAACA...TTCGATAAAAATGATTGGCGTATCCAACCTGCA

Biostrings: Creating a DNAString object

Here we make a DNAString to contain the sequence of exon 4 of the human oncogene homologue (c-abl) of the abelson murine leukemia virus.

abl1 = DNAString("GATCTTGCTGCCCGAAACTGCCTGGTAGGGGAGAACCACTTGGTGAAGGTAGCTGATTTTGGCCTGAGCAGGTTGATGACAGGGGACACCTACACAGCCCATGCTGGAGCCAAGTTCCCCATCAAATGGACTGCACCCGAGAGCCTGGCCTACAACAAGTTCTCCATCAAGTCCGACGTCTGGGGTAAGGGC")  # use a 'constructor' to make a DNAString
abl1  # take a look at the new object
  192-letter "DNAString" instance
seq: GATCTTGCTGCCCGAAACTGCCTGGTAGGGGAGA...TTCTCCATCAAGTCCGACGTCTGGGGTAAGGGC
RNAString(abl1)  # Now, feed the new DNAString into RNAString to convert
  192-letter "RNAString" instance
seq: GAUCUUGCUGCCCGAAACUGCCUGGUAGGGGAGA...UUCUCCAUCAAGUCCGACGUCUGGGGUAAGGGC

Biostrings: Basic Functions

reverse(abl1)  # reverse the sequence
  192-letter "DNAString" instance
seq: CGGGAATGGGGTCTGCAGCCTGAACTACCTCTTG...GAGGGGATGGTCCGTCAAAGCCCGTCGTTCTAG
complement(abl1)  # generate the complement of the sequence
  192-letter "DNAString" instance
seq: CTAGAACGACGGGCTTTGACGGACCATCCCCTCT...AAGAGGTAGTTCAGGCTGCAGACCCCATTCCCG
reverseComplement(abl1)  # generate the reverse complement
  192-letter "DNAString" instance
seq: GCCCTTACCCCAGACGTCGGACTTGATGGAGAAC...CTCCCCTACCAGGCAGTTTCGGGCAGCAAGATC

Biostrings: Basic Functions

abl1.aas = translate(abl1)  # translate the sequence
abl1.aas
  64-letter "AAString" instance
seq: DLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWGKG
length(abl1.aas)
[1] 64

Biostrings: alphabetFrequency()

The function alphabetFrequency looks interesting, so let's try it.

a = alphabetFrequency(phiX174Phage)  # generate a table of base counts
a
        A    C    G    T M R W S Y K V H D B N - + .
[1,] 1291 1157 1254 1684 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,] 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[4,] 1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[5,] 1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[6,] 1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Can we add rownames to this output?

Biostrings: alphabetFrequency()

names(phiX174Phage)
[1] "Genbank" "RF70s"   "SS78"    "Bull"    "G97"     "NEB03"  
rownames(a) = names(phiX174Phage)

a
           A    C    G    T M R W S Y K V H D B N - + .
Genbank 1291 1157 1254 1684 0 0 0 0 0 0 0 0 0 0 0 0 0 0
RF70s   1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SS78    1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Bull    1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G97     1292 1156 1253 1685 0 0 0 0 0 0 0 0 0 0 0 0 0 0
NEB03   1292 1155 1253 1686 0 0 0 0 0 0 0 0 0 0 0 0 0 0

How do we use this matrix?

  • We can comput %GC for these 6 sequences as follows:
(a[, "G"] + a[, "C"])/(a[, "G"] + a[, "C"] + a[, "A"] + a[, "T"])  # use the variable
  Genbank     RF70s      SS78      Bull       G97     NEB03 
0.4476420 0.4472707 0.4472707 0.4470850 0.4472707 0.4470850 
  • The sequences seem to be very similar in composition. How similar?

We can easily take a couple more steps.

  • Dinucleotide counts
dinucleotideFrequency(phiX174Phage)  # get a table of dinucleotide counts
      AA  AC  AG  AT  CA  CC  CG  CT  GA  GC  GG  GT  TA  TC  TG  TT
[1,] 395 261 251 383 257 229 267 404 327 347 255 325 312 320 480 572
[2,] 394 260 253 384 258 229 265 404 327 347 254 325 313 320 480 572
[3,] 394 260 253 384 258 229 265 404 327 347 254 325 313 320 480 572
[4,] 395 260 252 384 257 229 266 403 327 346 254 326 313 320 480 573
[5,] 395 261 253 382 257 229 266 404 326 346 254 327 314 320 479 572
[6,] 394 260 253 384 258 227 265 405 327 347 254 325 313 321 480 572
  • Trinucleotide Counts
trinucleotideFrequency(phiX174Phage)  # get a table of trinucleotide counts
     AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA
[1,] 133  67  93 102  47  62  63  89  72  53  74  52  60  56 140 127  91
[2,] 133  66  93 102  47  62  62  89  73  53  75  52  60  56 141 127  91
[3,] 133  66  93 102  47  62  62  89  73  53  75  52  60  56 141 127  91
[4,] 133  66  93 103  47  62  62  89  73  53  74  52  60  56 140 128  91
[5,] 133  66  93 103  47  62  62  90  73  53  75  52  60  56 139 127  91
[6,] 133  66  93 102  47  62  62  89  73  53  75  52  60  56 141 127  91
     CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC
[1,]  40  61  64  59  38  70  62  52  92  31  92  78  67 132 127  70  82
[2,]  40  62  64  59  38  70  62  52  92  30  91  78  67 132 127  69  82
[3,]  40  62  64  59  38  70  62  52  92  30  91  78  67 132 127  69  82
[4,]  39  62  64  58  38  71  62  52  92  30  92  78  67 132 126  70  82
[5,]  40  62  63  58  38  71  62  52  92  30  92  78  67 132 127  70  82
[6,]  40  62  64  59  36  69  63  52  92  30  91  78  68 132 127  69  82
     GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG
[1,]  74 101  60  71  67 149  63  72  20 100  54  65  66 140 101  72  23
[2,]  74 102  61  71  66 149  63  72  19 100  55  65  65 140 101  72  24
[3,]  74 102  61  71  66 149  63  72  19 100  55  65  65 140 101  72  24
[4,]  74 101  61  71  66 148  62  72  20 100  54  65  67 140 101  73  23
[5,]  74 100  61  71  66 148  62  72  19 101  55  65  67 140 101  73  24
[6,]  74 102  61  71  66 149  63  72  19 100  55  65  65 140 101  72  24
     TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT
[1,] 116  91  58  67 104 139 130 130  81 120 132 142 178
[2,] 116  91  58  67 104 138 130 130  82 120 132 142 178
[3,] 116  91  58  67 104 138 130 130  82 120 132 142 178
[4,] 116  91  58  67 104 139 129 130  82 121 132 141 179
[5,] 116  91  58  67 104 138 129 130  82 121 132 141 178
[6,] 116  91  58  68 104 138 130 130  82 120 132 142 178
  • Oligonucleotide counts
O = oligonucleotideFrequency(phiX174Phage, 5)
hist(O)

plot of chunk unnamed-chunk-34

BSgenome

The BSgenome library provides a container for genomes. A number of genomes have been packaged for Bioconductor.

library(BSgenome)  # load the library 'BSgenome'
available.genomes()[1:20]  # get a list of genomes you can install
 [1] "BSgenome.Alyrata.JGI.v1"                 
 [2] "BSgenome.Amellifera.BeeBase.assembly4"   
 [3] "BSgenome.Amellifera.UCSC.apiMel2"        
 [4] "BSgenome.Amellifera.UCSC.apiMel2.masked" 
 [5] "BSgenome.Athaliana.TAIR.04232008"        
 [6] "BSgenome.Athaliana.TAIR.TAIR9"           
 [7] "BSgenome.Btaurus.UCSC.bosTau3"           
 [8] "BSgenome.Btaurus.UCSC.bosTau3.masked"    
 [9] "BSgenome.Btaurus.UCSC.bosTau4"           
[10] "BSgenome.Btaurus.UCSC.bosTau4.masked"    
[11] "BSgenome.Btaurus.UCSC.bosTau6"           
[12] "BSgenome.Btaurus.UCSC.bosTau6.masked"    
[13] "BSgenome.Btaurus.UCSC.bosTau8"           
[14] "BSgenome.Celegans.UCSC.ce10"             
[15] "BSgenome.Celegans.UCSC.ce11"             
[16] "BSgenome.Celegans.UCSC.ce2"              
[17] "BSgenome.Celegans.UCSC.ce6"              
[18] "BSgenome.Cfamiliaris.UCSC.canFam2"       
[19] "BSgenome.Cfamiliaris.UCSC.canFam2.masked"
[20] "BSgenome.Cfamiliaris.UCSC.canFam3"       

Let's load the C. elegans genome to experiment.

library(BSgenome.Celegans.UCSC.ce6)  # load your local copy of the C. elegans genome

What can we do with BSgenomes?

Use ls("package:BSgenome") to see

ls("package:BSgenome")[1:20]  # list the methods of 'BSgenome'
 [1] "alphabetFrequency"           "as.data.frame"              
 [3] "as.data.frame.BSgenomeViews" "as.list"                    
 [5] "available.genomes"           "available.SNPs"             
 [7] "blocksizes"                  "breakpoints"                
 [9] "bsapply"                     "BSgenome"                   
[11] "BSgenomeViews"               "coerce"                     
[13] "colnames"                    "commonName"                 
[15] "compatibleGenomes"           "consensusMatrix"            
[17] "consensusString"             "countPWM"                   
[19] "elementNROWS"                "end"                        
BSgenome.Celegans.UCSC.ce6  # see some header information
Worm genome:
# organism: Caenorhabditis elegans (Worm)
# provider: UCSC
# provider version: ce6
# release date: May 2008
# release name: WormBase v. WS190
# 7 sequences:
#   chrI   chrII  chrIII chrIV  chrV   chrX   chrM                       
# (use 'seqnames()' to see all the sequence names, use the '$' or '[['
# operator to access a given sequence)
BSgenome.Celegans.UCSC.ce6[["chrI"]]  # display chromosome I
  15072421-letter "DNAString" instance
seq: GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT...GGCTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC
seqinfo(BSgenome.Celegans.UCSC.ce6)  # get a summary of the sequences included
Seqinfo object with 7 sequences (1 circular) from ce6 genome:
  seqnames seqlengths isCircular genome
  chrI       15072421      FALSE    ce6
  chrII      15279323      FALSE    ce6
  chrIII     13783681      FALSE    ce6
  chrIV      17493785      FALSE    ce6
  chrV       20919568      FALSE    ce6
  chrX       17718854      FALSE    ce6
  chrM          13794       TRUE    ce6
seqnames(BSgenome.Celegans.UCSC.ce6)  # get only the seq names
[1] "chrI"   "chrII"  "chrIII" "chrIV"  "chrV"   "chrX"   "chrM"  
seqlengths(BSgenome.Celegans.UCSC.ce6)  # get an array of the sequence lengths
    chrI    chrII   chrIII    chrIV     chrV     chrX     chrM 
15072421 15279323 13783681 17493785 20919568 17718854    13794 
barplot(seqlengths(BSgenome.Celegans.UCSC.ce6))  # generate a quick barplot

plot of chunk unnamed-chunk-37

What is the GC content of the C. elegans genome chromosome I?

F = alphabetFrequency(BSgenome.Celegans.UCSC.ce6[["chrI"]])  # get the frequency table

F  # take look at the table 
      A       C       G       T       M       R       W       S       Y 
4835939 2695879 2692150 4848453       0       0       0       0       0 
      K       V       H       D       B       N       -       +       . 
      0       0       0       0       0       0       0       0       0 
sum(F[c("G", "C")])/sum(F[c("A", "T", "C", "G")]) * 100  # compute % GC 
[1] 35.7476

Exercise: How can we comput %GC for all chromosomes?

Fr = vector()  # initialize a vector to hold the %GC values

for (s in seqnames(BSgenome.Celegans.UCSC.ce6)) {
    # loop through each of the 'seqnames'--these are the chromosomes
    F = alphabetFrequency(BSgenome.Celegans.UCSC.ce6[[s]])
    # get a table for one chromosome,compute %GC and store this in 'Fr[s]'
    Fr[s] = sum(F[c("G", "C")])/sum(F[c("A", "T", "C", "G")]) * 100
}


Fr  # display Fr to see if we got it right
    chrI    chrII   chrIII    chrIV     chrV     chrX     chrM 
35.74760 36.20196 35.66144 34.59386 35.42942 35.20301 23.77845 
barplot(Fr, ylab = "%GC", xlab = "C. elegans Chromosome", main = "GC Percentage by Chromosome")  # barplot it!

plot of chunk unnamed-chunk-40

GenomicFeatures

The GenomicFeatures library provides the functions for the extraction of the sequences of features such as genes, exons, introns, and promoters from BSgenomes.

Let's take a look at the C. elegans genome to start.

library(TxDb.Celegans.UCSC.ce6.ensGene)  # load the C. elegans transcript database

By loading this transcript database, we have also loaded a new library called GenomicFeatures that is required to use the database. What new functions have we acquired?

ls("package:GenomicFeatures")[1:40]  # list the methods of 'GenomicFeatures'
 [1] "as.list"                    "asBED"                     
 [3] "asGFF"                      "cds"                       
 [5] "cdsBy"                      "cdsByOverlaps"             
 [7] "coverageByTranscript"       "DEFAULT_CIRC_SEQS"         
 [9] "disjointExons"              "distance"                  
[11] "exons"                      "exonsBy"                   
[13] "exonsByOverlaps"            "extractTranscriptSeqs"     
[15] "extractUpstreamSeqs"        "features"                  
[17] "fiveUTRsByTranscript"       "genes"                     
[19] "getChromInfoFromBiomart"    "getChromInfoFromUCSC"      
[21] "getPromoterSeq"             "id2name"                   
[23] "intronsByTranscript"        "isActiveSeq"               
[25] "isActiveSeq<-"              "makeFDbPackageFromUCSC"    
[27] "makeFeatureDbFromUCSC"      "makePackageName"           
[29] "makeTxDb"                   "makeTxDbFromBiomart"       
[31] "makeTxDbFromGFF"            "makeTxDbFromGRanges"       
[33] "makeTxDbFromUCSC"           "makeTxDbPackage"           
[35] "makeTxDbPackageFromBiomart" "makeTxDbPackageFromUCSC"   
[37] "mapFromTranscripts"         "mapIdsToRanges"            
[39] "mapRangesToIds"             "mapToTranscripts"          

The functions exons(), genes(), and exons.By() look interesting so let's try them out:

ce.exons = exons(TxDb.Celegans.UCSC.ce6.ensGene)  # get the exons coordinates
head(ce.exons, 2)  # take a look
GRanges object with 2 ranges and 1 metadata column:
      seqnames         ranges strand |   exon_id
         <Rle>      <IRanges>  <Rle> | <integer>
  [1]     chrI [11495, 11561]      + |         1
  [2]     chrI [11499, 11557]      + |         2
  -------
  seqinfo: 7 sequences (1 circular) from ce6 genome
ce.genes = genes(TxDb.Celegans.UCSC.ce6.ensGene)  # get the gene coordinates
head(ce.genes, 2)  #take a look
GRanges object with 2 ranges and 1 metadata column:
          seqnames               ranges strand |     gene_id
             <Rle>            <IRanges>  <Rle> | <character>
   2L52.1    chrII [    1867,     4663]      + |      2L52.1
  2RSSE.1    chrII [15268114, 15273216]      + |     2RSSE.1
  -------
  seqinfo: 7 sequences (1 circular) from ce6 genome

The gene_ids in the first metadata column are ENSEMBL ids and can be used to search in the UCSC Genome Browser.

# get the exons with their gene assignments
ce.exons.bygene = exonsBy(TxDb.Celegans.UCSC.ce6.ensGene, by = c("gene"))
head(ce.exons.bygene, 1)  # take a look
GRangesList object of length 1:
$2L52.1 
GRanges object with 7 ranges and 2 metadata columns:
      seqnames       ranges strand |   exon_id   exon_name
         <Rle>    <IRanges>  <Rle> | <integer> <character>
  [1]    chrII [1867, 1911]      + |     21558        <NA>
  [2]    chrII [2506, 2694]      + |     21559        <NA>
  [3]    chrII [2738, 2888]      + |     21560        <NA>
  [4]    chrII [2931, 3036]      + |     21561        <NA>
  [5]    chrII [3406, 3552]      + |     21562        <NA>
  [6]    chrII [3802, 3984]      + |     21563        <NA>
  [7]    chrII [4201, 4663]      + |     21564        <NA>

-------
seqinfo: 7 sequences (1 circular) from ce6 genome

GenomicRanges

The GenomicRanges library provides the framework for feature extraction and analysis. Among its most powerful functions are those used ot determine feature overlap. We load the library as usual:

library("GenomicRanges")  # load the library 'GenomicRanges'

And we list the newly acquired functions in the usual manner:

ls("package:GenomicRanges")[1:20]  # list the methods of 'GenomicRanges'
 [1] "absoluteRanges"           "as.data.frame"           
 [3] "as.factor"                "bindAsGRanges"           
 [5] "binnedAverage"            "checkConstraint"         
 [7] "coerce"                   "countOverlaps"           
 [9] "coverage"                 "disjoin"                 
[11] "disjointBins"             "distance"                
[13] "distanceToNearest"        "duplicated"              
[15] "duplicated.GenomicRanges" "elementMetadata"         
[17] "elementMetadata<-"        "end"                     
[19] "end<-"                    "findOverlaps"            

Let's try out a few

length(ce.exons)  # how many exons are there in C. elegans?
[1] 146833
head(width(ce.exons))  # get the lengths for the first 6 exons
[1] 67 59 63 57 72 67

Take a free histogram

hist(width(ce.exons), main = "Distribution of Exon Lengths for C. Elegans")

plot of chunk unnamed-chunk-46

The histogram looks a bit cramped. How can we correct this?

hist(log10(width(ce.exons)), main = "Distribution of Exon Lengths for C. Elegans")

plot of chunk unnamed-chunk-47

GenomicAlignments

The GenomicAlignments library provides containers for storing and manipulating short genomic alignments (typically obtained by aligning short reads to a reference genome). This includes read counting, computing the coverage, junction detection, and working with the nucleotide sequences of the alignments.

library(GenomicAlignments)

Exercise: What new functions have we acquired?

bam <- readGAlignments(system.file("extdata", "ex1.bam", package = "Rsamtools"))

bam[1:2]
GAlignments object with 2 alignments and 0 metadata columns:
      seqnames strand       cigar    qwidth     start       end     width
         <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
  [1]     seq1      +         36M        36         1        36        36
  [2]     seq1      +         35M        35         3        37        35
          njunc
      <integer>
  [1]         0
  [2]         0
  -------
  seqinfo: 2 sequences from an unspecified genome
B7_591:4:96:693:509     73      seq1    1       99      36M     *       0       0
       CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG    <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7
MF:i:18 Aq:i:73 NM:i:0  UQ:i:0  H0:i:1  H1:i:0
EAS54_65:7:152:368:113  73      seq1    3       99      35M     *       0       0
       CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT     <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6):     MF:i:18 Aq:i:66 NM:i:0  UQ:i:0  H0:i:1  H1:i:0


Cigar Strings

RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  G  C  T
With the alignment above, you get:
POS: 5
CIGAR: 3M1I3M1D5M
sort(table(cigar(bam)), decreasing = TRUE)

     35M      36M      40M      34M      33M 14M4I17M 15M4I16M 16M4I15M 
    2804      283      112       37        6        4        4        3 
 7M4I24M 17M4I14M  9M2I24M 10M4I21M 11M5I19M 12M4I19M 13M1D22M 13M4I18M 
       3        2        2        1        1        1        1        1 
14M5I17M 18M4I13M 18M5I12M  7M2I31M  8M4I24M  9M1D26M 
       1        1        1        1        1        1 

Counting Overlaps Between Features

  • Overlaps between genomic features can be computed using findOverlaps.
  • Let's first count overlaps between exons in our ce.exons object to illustrate basic principles

Overlap the first exon with the others

ce.exons[1:5]
GRanges object with 5 ranges and 1 metadata column:
      seqnames         ranges strand |   exon_id
         <Rle>      <IRanges>  <Rle> | <integer>
  [1]     chrI [11495, 11561]      + |         1
  [2]     chrI [11499, 11557]      + |         2
  [3]     chrI [11499, 11561]      + |         3
  [4]     chrI [11505, 11561]      + |         4
  [5]     chrI [11618, 11689]      + |         5
  -------
  seqinfo: 7 sequences (1 circular) from ce6 genome
findOverlaps(ce.exons[1], ce.exons[-1], ignore.strand = TRUE)
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         1           2
  [3]         1           3
  -------
  queryLength: 1 / subjectLength: 146832

Distribution of # overlaps for first 100 exons

E = array()
for (i in 1:100) {
    F = findOverlaps(ce.exons[i], ce.exons[-i], ignore.strand = TRUE)
    E[i] = length(queryHits(F))
}
hist(E)

plot of chunk unnamed-chunk-52

Palindromes witin exons: the Plan

  • Find palindromes on chr1
  • Format result for findOverlaps
  • Get overlaps with exons
P = findPalindromes(BSgenome.Celegans.UCSC.ce6$chrI)
length(P)
[1] 213154

The palindromes

  • Are the two objects compatible?
P
  Views on a 15072421-letter DNAString subject
subject: GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGC...CTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC
views:
            start      end width
     [1]      460      468     9 [TTTTCAAAA]
     [2]      488      497    10 [AAAACGTTTT]
     [3]      507      514     8 [GAAGCTTC]
     [4]      573      582    10 [TTAGGCCTAA]
     [5]      856      863     8 [GTAGCTAC]
     ...      ...      ...   ... ...
[213150] 15071791 15071798     8 [AAAATTTT]
[213151] 15071985 15071993     9 [GTCTGAGAC]
[213152] 15072046 15072053     8 [TAGTACTA]
[213153] 15072053 15072064    12 [AGTGCCGGCACT]
[213154] 15072085 15072093     9 [GAATGATTC]
class(ce.exons)
[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"

Converting to GRanges

  • Use the GRanges constructor function
Pg = GRanges(seq = "chrI", IRanges(start = start(P), end = end(P)))

Pg
GRanges object with 213154 ranges and 0 metadata columns:
           seqnames               ranges strand
              <Rle>            <IRanges>  <Rle>
       [1]     chrI           [460, 468]      *
       [2]     chrI           [488, 497]      *
       [3]     chrI           [507, 514]      *
       [4]     chrI           [573, 582]      *
       [5]     chrI           [856, 863]      *
       ...      ...                  ...    ...
  [213150]     chrI [15071791, 15071798]      *
  [213151]     chrI [15071985, 15071993]      *
  [213152]     chrI [15072046, 15072053]      *
  [213153]     chrI [15072053, 15072064]      *
  [213154]     chrI [15072085, 15072093]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Finding the overlaps

F = findOverlaps(Pg, ce.exons)
length(F)
[1] 46436

Selecting palindromes by width

  • Maybe longer palindromes are underrepresented within exons

  • The histogram suggests a cutoff of 101.2 (15) might define a subset of long palindromes

hist(log10(width(P)))

plot of chunk unnamed-chunk-57

There are 5889 long palindromes on chr1

  • Here we subset the P “XStringViews” object
  • Then, create a new GenomicRanges object from its coordinate list
P15 = P[width(P) > 15]
Pg15 = GRanges(seq = "chrI", IRanges(start = start(P15), end = end(P15)))
length(Pg15)
[1] 5889

Long palidromes seem to avoid exons

  • 5% of long palindromes overlap with exons
  • 22% of all palindromes overlap with exons
F15 = findOverlaps(Pg15, ce.exons)
length(F15)
[1] 310
length(F15)/length(Pg15)
[1] 0.05264052
length(F)/length(Pg)
[1] 0.2178519