Lesson 7: Introduction to Bioconductor -omics classes (containers)
Objectives
- To explore Bioconductor, a repository for R packages related to biological data analysis.
- To better understand S4 objects as they relate to the Bioconductor core infrastructure.
- To learn more about a popular Bioconductor S4 class,
SummarizedExperiment
.
What is Bioconductor?
Bioconductor is a repository for R packages related to biological data analysis, primarily bioinformatics and computational biology, and as such it is a great place to search for -omics packages and pipelines.
Bioconductor packages fit within four main categories:
- software
- annotation data
- experiment data
- workflows.
As of November 2023, Bioconductor v3.18 included 2,266 software packages, 429 experiment data packages, 920 annotation packages, 30 workflows, and 4 books.
To browse these packages, use BiocViews.
Important things to know about Bioconductor
- Bioconductor is a package repository, like CRAN
- All Bioconductor packages must be installed following the instructions here: https://bioconductor.org/install
- Bioconductor packages are linked in their versions, both to each other and to the version of R
- Bioconductor’s installation function will look up your version of R and give you the appropriate versions of Bioconductor packages
- If you want the latest version of Bioconductor, you need to use the latest version of R
--- Michael Love
Core infrastructure
An important goal of the Bioconductor project is interoperability, or the ability of packages to work together using shared data classes and methods. This is achieved through the use of common data structures. These common data structures are "implemented as classes in the S4 object-oriented system of the R language".
What is object oriented programming?
R is generally thought of as a functional programming language, where the focus is on the functions rather than the object, and the output of the function is always the same given the same inputs. This is a particular useful way of approaching problems, especially when the focus is on data analysis. This differs from an object oriented language, where the functions are built around an object (e.g., a data structure), and the output of a function can change based on the attributes of the object. This can also be a useful way to analyze data. As you may have guessed, these programming paradigms are not mutually exclusive and both are used with R programming.
Comparing functional programming vs object oriented programming
For more information comparing these paradigms, check out this brief article from Medium.
Object-oriented programming allows the same function to be used for many different types of input (e.g., summary()
) (Advanced R).
Note
We work with S3 objects all the time in R and these are prominent when using base R programming.
Terms to know for object oriented programming with R
- class - object type
- method - implementation for a class; describes what an object can do
- method dispatch - finds the correct method for a class
Classes are organised in a hierarchy so that if a method does not exist for one class, its parent’s method is used, and the child is said to inherit behaviour. For example, in R, an ordered factor inherits from a regular factor, and a generalised linear model inherits from a linear model. --- Advanced R
S4 objects and Bioconductor
-omics data can be fairly complex, and data structures are a useful way of organizing and working with complex data. Many data structures are used or extended across multiple Bioconductor packages, thereby allowing users to approach and use new packages with less of a learning hurdle.
Understanding S4 classes
This lesson will not go into too much detail regarding the S4 class system. You do not need to know that much about S4 systems to use S4 objects. However, some of the basics will be described below. If you would like to learn more about S4 classes, see this lesson with The Carpentries and this chapter of Advanced R.
What is the S4 system?
S4 classes create the data structures that store complex information (e.g., biological assay data and metadata) in one or more slots. The entire structure can then be assigned to an R object. The types of information in each slot of the object is tightly controlled. S4 generics and methods define functions that can be applied to these objects. See this lesson from The Carpentries, authored by the Bioconductor Project community, for more information.
In brief, S4 objects have slots that store defined types of information. The use of the object can be extended by defining a new class, which will inherit the attributes of the old class, including all of the slots, and add new slots.
Info
You can find a list of common classes used by the Bioconductor Project here.
Let's take a look at the SummarizedExperiment
, a commonly used class in many Bioconductor packages.
Introducing the SummarizedExperiment
The SummarizedExperiment
class and the inherited class RangedSummarizedExperiment
are available in the R package SummarizedExperiment
.
SummarizedExperiment is a matrix-like container where rows represent features of interest (e.g. genes, transcripts, exons, etc.) and columns represent samples. The objects contain one or more assays, each represented by a matrix-like object of numeric or other mode. --- SummarizedExperiment vignette
Image from the SummarizedExperiment vignette.
The advantage of using a SummarizedExperiment
is that when manipulating the object, the data elements in each slot maintain consistency, meaning if a sample is removed, the corresponding sample column in the expression data would also be removed. This keeps you from accidentally including data that should have been excluded or vice versa.
Working with a summarized experiment.
We can use the airway
package to see how this container works, including how to access and subset the data.
What is the airway
package?
There are data sets available in R to practice with or showcase different packages. The airway data is from Himes et al. (2014). These data, which are contained within a RangedSummarizedExperiment
object, are from a bulk RNAseq experiment. In the experiment, the authors "characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone," a common therapy for asthma. The airway package includes RNAseq count data from 8 airway smooth muscle cell samples.
The airway
data is packaged in a RangedSummarizedExperiment
. A RangedSummarizedExperiment
inherits the features of a SummarizedExperiment
but instead of only including feature metadata for the rows, it also includes genomic position information.
Note
Other popular classes that inherit from the SummarizedExperiment
include the DESeqDataSet
and the SingleCellExperiment
.
library(SummarizedExperiment)
library(airway)
data(airway, package="airway")
se <- airway
se
## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
Accessors
As you can see from the image, there are several accessor functions to access the data from the object:
-
assays()
- access matrix-like experimental data (e.g., count data). Rows are genomic features (e.g., transcripts) and columns are samples.This can hold more than one assay, and each assay should be called directly with the
$
accessor.assays(se) #we see one assay listed
## List of length 1 ## names(1): counts
head(assays(se)$counts) #calling the counts with $
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 ## ENSG00000000003 679 448 873 408 1138 ## ENSG00000000005 0 0 0 0 0 ## ENSG00000000419 467 515 621 365 587 ## ENSG00000000457 260 211 263 164 245 ## ENSG00000000460 60 55 40 35 78 ## ENSG00000000938 0 0 2 0 1 ## SRR1039517 SRR1039520 SRR1039521 ## ENSG00000000003 1047 770 572 ## ENSG00000000005 0 0 0 ## ENSG00000000419 799 417 508 ## ENSG00000000457 331 233 229 ## ENSG00000000460 63 76 60 ## ENSG00000000938 0 0 0
-
colData()
- access sample metadata, as aDataFrame
colData(se)
## DataFrame with 8 rows and 9 columns ## SampleName cell dex albut Run avgLength ## <factor> <factor> <factor> <factor> <factor> <integer> ## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 ## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 ## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 ## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 ## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 ## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 ## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 ## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 ## Experiment Sample BioSample ## <factor> <factor> <factor> ## SRR1039508 SRX384345 SRS508568 SAMN02422669 ## SRR1039509 SRX384346 SRS508567 SAMN02422675 ## SRR1039512 SRX384349 SRS508571 SAMN02422678 ## SRR1039513 SRX384350 SRS508572 SAMN02422670 ## SRR1039516 SRX384353 SRS508575 SAMN02422682 ## SRR1039517 SRX384354 SRS508576 SAMN02422673 ## SRR1039520 SRX384357 SRS508579 SAMN02422683 ## SRR1039521 SRX384358 SRS508580 SAMN02422677
-
rowData()
- access feature metadata (e.g., differential expression results)rowData(se)
## DataFrame with 63677 rows and 10 columns ## gene_id gene_name entrezid gene_biotype ## <character> <character> <integer> <character> ## ENSG00000000003 ENSG00000000003 TSPAN6 NA protein_coding ## ENSG00000000005 ENSG00000000005 TNMD NA protein_coding ## ENSG00000000419 ENSG00000000419 DPM1 NA protein_coding ## ENSG00000000457 ENSG00000000457 SCYL3 NA protein_coding ## ENSG00000000460 ENSG00000000460 C1orf112 NA protein_coding ## ... ... ... ... ... ## ENSG00000273489 ENSG00000273489 RP11-180C16.1 NA antisense ## ENSG00000273490 ENSG00000273490 TSEN34 NA protein_coding ## ENSG00000273491 ENSG00000273491 RP11-138A9.2 NA lincRNA ## ENSG00000273492 ENSG00000273492 AP000230.1 NA lincRNA ## ENSG00000273493 ENSG00000273493 RP11-80H18.4 NA lincRNA ## gene_seq_start gene_seq_end seq_name seq_strand ## <integer> <integer> <character> <integer> ## ENSG00000000003 99883667 99894988 X -1 ## ENSG00000000005 99839799 99854882 X 1 ## ENSG00000000419 49551404 49575092 20 -1 ## ENSG00000000457 169818772 169863408 1 -1 ## ENSG00000000460 169631245 169823221 1 1 ## ... ... ... ... ... ## ENSG00000273489 131178723 131182453 7 -1 ## ENSG00000273490 54693789 54697585 HSCHR19LRC_LRC_J_CTG1 1 ## ENSG00000273491 130600118 130603315 HG1308_PATCH 1 ## ENSG00000273492 27543189 27589700 21 1 ## ENSG00000273493 58315692 58315845 3 1 ## seq_coord_system symbol ## <integer> <character> ## ENSG00000000003 NA TSPAN6 ## ENSG00000000005 NA TNMD ## ENSG00000000419 NA DPM1 ## ENSG00000000457 NA SCYL3 ## ENSG00000000460 NA C1orf112 ## ... ... ... ## ENSG00000273489 NA RP11-180C16.1 ## ENSG00000273490 NA TSEN34 ## ENSG00000273491 NA RP11-138A9.2 ## ENSG00000273492 NA AP000230.1 ## ENSG00000273493 NA RP11-80H18.4
-
rowRanges(se)
- because this is aRangedSummarizedExperiment
, there is more information about the genomic ranges spanning the exons of each transcriptrowRanges(se)$ENSG00000000003
## GRanges object with 17 ranges and 2 metadata columns: ## seqnames ranges strand | exon_id exon_name ## <Rle> <IRanges> <Rle> | <integer> <character> ## [1] X 99883667-99884983 - | 667145 ENSE00001459322 ## [2] X 99885756-99885863 - | 667146 ENSE00000868868 ## [3] X 99887482-99887565 - | 667147 ENSE00000401072 ## [4] X 99887538-99887565 - | 667148 ENSE00001849132 ## [5] X 99888402-99888536 - | 667149 ENSE00003554016 ## ... ... ... ... . ... ... ## [13] X 99890555-99890743 - | 667156 ENSE00003512331 ## [14] X 99891188-99891686 - | 667158 ENSE00001886883 ## [15] X 99891605-99891803 - | 667159 ENSE00001855382 ## [16] X 99891790-99892101 - | 667160 ENSE00001863395 ## [17] X 99894942-99894988 - | 667161 ENSE00001828996 ## ------- ## seqinfo: 722 sequences (1 circular) from an unspecified genome
-
-
metadata()
- access experiment wide unstructured metadata (e.g., experimental methods, publication references)metadata(se)
## [[1]] ## Experiment data ## Experimenter name: Himes BE ## Laboratory: NA ## Contact information: ## Title: RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells. ## URL: http://www.ncbi.nlm.nih.gov/pubmed/24926665 ## PMIDs: 24926665 ## ## Abstract: A 226 word abstract is available. Use 'abstract' method.
```
Note
You can use str()
to get an idea of how to access information from each slot. Notice the use of @
.
Now that we know how to access the information stored in the object se
, how can we subset it?
Subsetting and manipulating SummarizedExperiments
First, notice that you can easily access columns from the sample metadata (colData()
) using $
.
Using brackets to subset:
se$SampleName
## [1] GSM1275862 GSM1275863 GSM1275866 GSM1275867 GSM1275870 GSM1275871 GSM1275874
## [8] GSM1275875
## 8 Levels: GSM1275862 GSM1275863 GSM1275866 GSM1275867 ... GSM1275875
se$dex
## [1] untrt trt untrt trt untrt trt untrt trt
## Levels: trt untrt
levels(se$dex)
## [1] "trt" "untrt"
We can also use 2D subsetting like we would for a data frame (i.e., []
notation), where left of the comma applies to the rows (features) and right of the comma applies to the columns (samples).
If we only wanted the first 10 transcripts and the first three samples, we could use
se[1:10,1:3]
## class: RangedSummarizedExperiment
## dim: 10 3
## metadata(1): ''
## assays(1): counts
## rownames(10): ENSG00000000003 ENSG00000000005 ... ENSG00000001084
## ENSG00000001167
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(3): SRR1039508 SRR1039509 SRR1039512
## colData names(9): SampleName cell ... Sample BioSample
[]
.
For example, if we only want to include treated samples.
se[ ,se$dex=="trt"] #notice the dimensions of the data have changed
## class: RangedSummarizedExperiment
## dim: 63677 4
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(4): SRR1039509 SRR1039513 SRR1039517 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
se[ ,se$dex=="trt"]$dex #we can check to see that only trt samples remain
## [1] trt trt trt trt
## Levels: trt untrt
We can do the same with feature information. For example, if we only want to include protein_coding
transcripts:
se[rowData(se)$gene_biotype == "protein_coding",]
## class: RangedSummarizedExperiment
## dim: 22810 8
## metadata(1): ''
## assays(1): counts
## rownames(22810): ENSG00000000003 ENSG00000000005 ... ENSG00000273482
## ENSG00000273490
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
OR
We could filter using the count assay. For example, keeping only transcripts that have a count of at least 10 across a minimum of 4 samples.
transcripts_to_keep <- rowSums(assays(se)$counts >= 10) >= 4
se[transcripts_to_keep,]
## class: RangedSummarizedExperiment
## dim: 16139 8
## metadata(1): ''
## assays(1): counts
## rownames(16139): ENSG00000000003 ENSG00000000419 ... ENSG00000273486
## ENSG00000273487
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
Check out the SummarizedExperiment
vignette for more information!
Using tidySummarizedExperiment
Why did we focus so heavily on the tidyverse if it can't be used to manipulate Bioconductor objects? Well, for one, regardless of whether you are a user of Bioconductor packages, you will often manipulate data frames, which is where the tidyverse collection of packages is super useful. Second, you can actually use tidyverse commands in certain contexts. The R package, tidySummarizedExperiment
allows you to view a SummarizedExperiment
object as a tibble (a tidyverse data frame) and provides other tidyverse compatible functions from the packages dplyr
, tidyr
, ggplot
, and plotly
.
library(tidySummarizedExperiment)
# filter to samples in the untreated condition
se %>% filter(dex == "untrt")
## # A SummarizedExperiment-tibble abstraction: 254,708 × 23
## # [90mFeatures=63677 | Samples=4 | Assays=counts[0m
## .feature .sample counts SampleName cell dex albut Run avgLength
## <chr> <chr> <int> <fct> <fct> <fct> <fct> <fct> <int>
## 1 ENSG00000000003 SRR10395… 679 GSM1275862 N613… untrt untrt SRR1… 126
## 2 ENSG00000000005 SRR10395… 0 GSM1275862 N613… untrt untrt SRR1… 126
## 3 ENSG00000000419 SRR10395… 467 GSM1275862 N613… untrt untrt SRR1… 126
## 4 ENSG00000000457 SRR10395… 260 GSM1275862 N613… untrt untrt SRR1… 126
## 5 ENSG00000000460 SRR10395… 60 GSM1275862 N613… untrt untrt SRR1… 126
## 6 ENSG00000000938 SRR10395… 0 GSM1275862 N613… untrt untrt SRR1… 126
## 7 ENSG00000000971 SRR10395… 3251 GSM1275862 N613… untrt untrt SRR1… 126
## 8 ENSG00000001036 SRR10395… 1433 GSM1275862 N613… untrt untrt SRR1… 126
## 9 ENSG00000001084 SRR10395… 519 GSM1275862 N613… untrt untrt SRR1… 126
## 10 ENSG00000001167 SRR10395… 394 GSM1275862 N613… untrt untrt SRR1… 126
## # ℹ 40 more rows
## # ℹ 14 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
## # gene_id <chr>, gene_name <chr>, entrezid <int>, gene_biotype <chr>,
## # gene_seq_start <int>, gene_seq_end <int>, seq_name <chr>, seq_strand <int>,
## # seq_coord_system <int>, symbol <chr>, GRangesList <list>
# total over all genes, per condition
se %>%
group_by(dex) %>%
summarize(total=sum(counts))
## # A tibble: 2 × 2
## dex total
## <fct> <int>
## 1 trt 85955244
## 2 untrt 89561179
# the features where mean expression > 0
se %>%
group_by(.feature) %>%
mutate(mean_exprs = mean(counts)) %>%
filter(mean_exprs > 0)
## # A tibble: 267,752 × 24
## # Groups: .feature [33,469]
## .feature .sample counts SampleName cell dex albut Run avgLength
## <chr> <chr> <int> <fct> <fct> <fct> <fct> <fct> <int>
## 1 ENSG00000000003 SRR10395… 679 GSM1275862 N613… untrt untrt SRR1… 126
## 2 ENSG00000000419 SRR10395… 467 GSM1275862 N613… untrt untrt SRR1… 126
## 3 ENSG00000000457 SRR10395… 260 GSM1275862 N613… untrt untrt SRR1… 126
## 4 ENSG00000000460 SRR10395… 60 GSM1275862 N613… untrt untrt SRR1… 126
## 5 ENSG00000000938 SRR10395… 0 GSM1275862 N613… untrt untrt SRR1… 126
## 6 ENSG00000000971 SRR10395… 3251 GSM1275862 N613… untrt untrt SRR1… 126
## 7 ENSG00000001036 SRR10395… 1433 GSM1275862 N613… untrt untrt SRR1… 126
## 8 ENSG00000001084 SRR10395… 519 GSM1275862 N613… untrt untrt SRR1… 126
## 9 ENSG00000001167 SRR10395… 394 GSM1275862 N613… untrt untrt SRR1… 126
## 10 ENSG00000001460 SRR10395… 172 GSM1275862 N613… untrt untrt SRR1… 126
## # ℹ 267,742 more rows
## # ℹ 15 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
## # gene_id <chr>, gene_name <chr>, entrezid <int>, gene_biotype <chr>,
## # gene_seq_start <int>, gene_seq_end <int>, seq_name <chr>, seq_strand <int>,
## # seq_coord_system <int>, symbol <chr>, GRangesList <list>, mean_exprs <dbl>
Other packages uniting bioinformatics and the tidyverse
The following is not a comprehensive list:
tidySingleCellExperiment
- forSingleCellExperiment
objects-
tidyseurat
- forSeurat
objectsNote
Seurat
is not a Bioconductor package. It is a CRAN package.tidyseurat
is also a CRAN package. -
tidybulk
- for analyzing RNA-seq data tidyHeatmap
- make heatmaps from tidy data - a CRAN package
Saving an R object
S4 objects store complex information that isn't necessarily simple to save. If you intend to work with the object further, try using saveRDS
.
saveRDS(se, "airways.rds") #save the object
Note
saveRDS()
is used to save a single object. To save multiple objects, use save()
. To save your entire R environment, use save.image()
.
To load the object, back to your R environment, use readRDS()
.
rm(se) # remove the object from the environment
readRDS("./airways.rds") #load the object from file
## # A SummarizedExperiment-tibble abstraction: 509,416 × 23
## # [90mFeatures=63677 | Samples=8 | Assays=counts[0m
## .feature .sample counts SampleName cell dex albut Run avgLength
## <chr> <chr> <int> <fct> <fct> <fct> <fct> <fct> <int>
## 1 ENSG00000000003 SRR10395… 679 GSM1275862 N613… untrt untrt SRR1… 126
## 2 ENSG00000000005 SRR10395… 0 GSM1275862 N613… untrt untrt SRR1… 126
## 3 ENSG00000000419 SRR10395… 467 GSM1275862 N613… untrt untrt SRR1… 126
## 4 ENSG00000000457 SRR10395… 260 GSM1275862 N613… untrt untrt SRR1… 126
## 5 ENSG00000000460 SRR10395… 60 GSM1275862 N613… untrt untrt SRR1… 126
## 6 ENSG00000000938 SRR10395… 0 GSM1275862 N613… untrt untrt SRR1… 126
## 7 ENSG00000000971 SRR10395… 3251 GSM1275862 N613… untrt untrt SRR1… 126
## 8 ENSG00000001036 SRR10395… 1433 GSM1275862 N613… untrt untrt SRR1… 126
## 9 ENSG00000001084 SRR10395… 519 GSM1275862 N613… untrt untrt SRR1… 126
## 10 ENSG00000001167 SRR10395… 394 GSM1275862 N613… untrt untrt SRR1… 126
## # ℹ 40 more rows
## # ℹ 14 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
## # gene_id <chr>, gene_name <chr>, entrezid <int>, gene_biotype <chr>,
## # gene_seq_start <int>, gene_seq_end <int>, seq_name <chr>, seq_strand <int>,
## # seq_coord_system <int>, symbol <chr>, GRangesList <list>
Acknowledgements
Content from this lesson was inspired by or taken from the following sources:
- https://bioconductor.org/packages/devel/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html
- https://carpentries-incubator.github.io/bioc-project/05-s4.html#s4-classes-and-methods
- https://carpentries-incubator.github.io/bioc-intro/60-next-steps.html
- https://biodatascience.github.io/compbio/bioc/objects.html