Lesson 7: Introduction to Bioconductor -omics classes (containers)

Objectives

To explore Bioconductor, a repository for R packages related to biological data analysis.
To better understand S4 objects as they relate to the Bioconductor core infrastructure.
To learn more about a popular Bioconductor S4 class, SummarizedExperiment.

What is Bioconductor?

Bioconductor is a repository for R packages related to biological data analysis, primarily bioinformatics and computational biology, and as such it is a great place to search for -omics packages and pipelines.

Bioconductor packages fit within four main categories:

software
annotation data
experiment data
workflows.

As of November 2023, Bioconductor v3.18 included 2,266 software packages, 429 experiment data packages, 920 annotation packages, 30 workflows, and 4 books.
To browse these packages, use BiocViews.

Important things to know about Bioconductor

Bioconductor is a package repository, like CRAN

All Bioconductor packages must be installed following the instructions here: https://bioconductor.org/install

Bioconductor packages are linked in their versions, both to each other and to the version of R

Bioconductor’s installation function will look up your version of R and give you the appropriate versions of Bioconductor packages

If you want the latest version of Bioconductor, you need to use the latest version of R
--- Michael Love

Core infrastructure

An important goal of the Bioconductor project is interoperability, or the ability of packages to work together using shared data classes and methods. This is achieved through the use of common data structures. These common data structures are "implemented as classes in the S4 object-oriented system of the R language".

What is object oriented programming?

R is generally thought of as a functional programming language, where the focus is on the functions rather than the object, and the output of the function is always the same given the same inputs. This is a particular useful way of approaching problems, especially when the focus is on data analysis. This differs from an object oriented language, where the functions are built around an object (e.g., a data structure), and the output of a function can change based on the attributes of the object. This can also be a useful way to analyze data. As you may have guessed, these programming paradigms are not mutually exclusive and both are used with R programming.

Comparing functional programming vs object oriented programming

For more information comparing these paradigms, check out this brief article from Medium.

Object-oriented programming allows the same function to be used for many different types of input (e.g., summary()) (Advanced R).

Note

We work with S3 objects all the time in R and these are prominent when using base R programming.

Terms to know for object oriented programming with R

class - object type
method - implementation for a class; describes what an object can do
method dispatch - finds the correct method for a class

Classes are organised in a hierarchy so that if a method does not exist for one class, its parent’s method is used, and the child is said to inherit behaviour. For example, in R, an ordered factor inherits from a regular factor, and a generalised linear model inherits from a linear model. --- Advanced R

S4 objects and Bioconductor

-omics data can be fairly complex, and data structures are a useful way of organizing and working with complex data. Many data structures are used or extended across multiple Bioconductor packages, thereby allowing users to approach and use new packages with less of a learning hurdle.

Understanding S4 classes

This lesson will not go into too much detail regarding the S4 class system. You do not need to know that much about S4 systems to use S4 objects. However, some of the basics will be described below. If you would like to learn more about S4 classes, see this lesson with The Carpentries and this chapter of Advanced R.

What is the S4 system?

S4 is a rigorous system that forces you to think carefully about program design. It’s particularly well-suited for building large systems that evolve over time and will receive contributions from many programmers. --- Advanced R

S4 classes create the data structures that store complex information (e.g., biological assay data and metadata) in one or more slots. The entire structure can then be assigned to an R object. The types of information in each slot of the object is tightly controlled. S4 generics and methods define functions that can be applied to these objects. See this lesson from The Carpentries, authored by the Bioconductor Project community, for more information.

In brief, S4 objects have slots that store defined types of information. The use of the object can be extended by defining a new class, which will inherit the attributes of the old class, including all of the slots, and add new slots.

Info

You can find a list of common classes used by the Bioconductor Project here.

Let's take a look at the SummarizedExperiment, a commonly used class in many Bioconductor packages.

Introducing the SummarizedExperiment

The SummarizedExperiment class and the inherited class RangedSummarizedExperiment are available in the R package SummarizedExperiment.

SummarizedExperiment is a matrix-like container where rows represent features of interest (e.g. genes, transcripts, exons, etc.) and columns represent samples. The objects contain one or more assays, each represented by a matrix-like object of numeric or other mode. --- SummarizedExperiment vignette

Image from the SummarizedExperiment vignette.

The advantage of using a SummarizedExperiment is that when manipulating the object, the data elements in each slot maintain consistency, meaning if a sample is removed, the corresponding sample column in the expression data would also be removed. This keeps you from accidentally including data that should have been excluded or vice versa.

Working with a summarized experiment.

We can use the airway package to see how this container works, including how to access and subset the data.

What is the airway package?

There are data sets available in R to practice with or showcase different packages. The airway data is from Himes et al. (2014). These data, which are contained within a RangedSummarizedExperiment object, are from a bulk RNAseq experiment. In the experiment, the authors "characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone," a common therapy for asthma. The airway package includes RNAseq count data from 8 airway smooth muscle cell samples.

The airway data is packaged in a RangedSummarizedExperiment. A RangedSummarizedExperiment inherits the features of a SummarizedExperiment but instead of only including feature metadata for the rows, it also includes genomic position information.

Note

Other popular classes that inherit from the SummarizedExperiment include the DESeqDataSet and the SingleCellExperiment.

library(SummarizedExperiment)
library(airway)
data(airway, package="airway")
se <- airway
se

## class: RangedSummarizedExperiment 
## dim: 63677 8 
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
##   ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

Accessors

As you can see from the image, there are several accessor functions to access the data from the object:

assays() - access matrix-like experimental data (e.g., count data). Rows are genomic features (e.g., transcripts) and columns are samples.

This can hold more than one assay, and each assay should be called directly with the $ accessor.

assays(se) #we see one assay listed

## List of length 1
## names(1): counts

head(assays(se)$counts) #calling the counts with $

##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003        679        448        873        408       1138
## ENSG00000000005          0          0          0          0          0
## ENSG00000000419        467        515        621        365        587
## ENSG00000000457        260        211        263        164        245
## ENSG00000000460         60         55         40         35         78
## ENSG00000000938          0          0          2          0          1
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003       1047        770        572
## ENSG00000000005          0          0          0
## ENSG00000000419        799        417        508
## ENSG00000000457        331        233        229
## ENSG00000000460         63         76         60
## ENSG00000000938          0          0          0

colData() - access sample metadata, as a DataFrame

colData(se)

## DataFrame with 8 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611    trt      untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611    trt      untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011    trt      untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039520  SRX384357 SRS508579 SAMN02422683
## SRR1039521  SRX384358 SRS508580 SAMN02422677

rowData() - access feature metadata (e.g., differential expression results)

rowData(se)

## DataFrame with 63677 rows and 10 columns
##                         gene_id     gene_name  entrezid   gene_biotype
##                     <character>   <character> <integer>    <character>
## ENSG00000000003 ENSG00000000003        TSPAN6        NA protein_coding
## ENSG00000000005 ENSG00000000005          TNMD        NA protein_coding
## ENSG00000000419 ENSG00000000419          DPM1        NA protein_coding
## ENSG00000000457 ENSG00000000457         SCYL3        NA protein_coding
## ENSG00000000460 ENSG00000000460      C1orf112        NA protein_coding
## ...                         ...           ...       ...            ...
## ENSG00000273489 ENSG00000273489 RP11-180C16.1        NA      antisense
## ENSG00000273490 ENSG00000273490        TSEN34        NA protein_coding
## ENSG00000273491 ENSG00000273491  RP11-138A9.2        NA        lincRNA
## ENSG00000273492 ENSG00000273492    AP000230.1        NA        lincRNA
## ENSG00000273493 ENSG00000273493  RP11-80H18.4        NA        lincRNA
##                 gene_seq_start gene_seq_end              seq_name seq_strand
##                      <integer>    <integer>           <character>  <integer>
## ENSG00000000003       99883667     99894988                     X         -1
## ENSG00000000005       99839799     99854882                     X          1
## ENSG00000000419       49551404     49575092                    20         -1
## ENSG00000000457      169818772    169863408                     1         -1
## ENSG00000000460      169631245    169823221                     1          1
## ...                        ...          ...                   ...        ...
## ENSG00000273489      131178723    131182453                     7         -1
## ENSG00000273490       54693789     54697585 HSCHR19LRC_LRC_J_CTG1          1
## ENSG00000273491      130600118    130603315          HG1308_PATCH          1
## ENSG00000273492       27543189     27589700                    21          1
## ENSG00000273493       58315692     58315845                     3          1
##                 seq_coord_system        symbol
##                        <integer>   <character>
## ENSG00000000003               NA        TSPAN6
## ENSG00000000005               NA          TNMD
## ENSG00000000419               NA          DPM1
## ENSG00000000457               NA         SCYL3
## ENSG00000000460               NA      C1orf112
## ...                          ...           ...
## ENSG00000273489               NA RP11-180C16.1
## ENSG00000273490               NA        TSEN34
## ENSG00000273491               NA  RP11-138A9.2
## ENSG00000273492               NA    AP000230.1
## ENSG00000273493               NA  RP11-80H18.4

rowRanges(se) - because this is a RangedSummarizedExperiment, there is more information about the genomic ranges spanning the exons of each transcript

rowRanges(se)$ENSG00000000003

## GRanges object with 17 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id       exon_name
##           <Rle>         <IRanges>  <Rle> | <integer>     <character>
##    [1]        X 99883667-99884983      - |    667145 ENSE00001459322
##    [2]        X 99885756-99885863      - |    667146 ENSE00000868868
##    [3]        X 99887482-99887565      - |    667147 ENSE00000401072
##    [4]        X 99887538-99887565      - |    667148 ENSE00001849132
##    [5]        X 99888402-99888536      - |    667149 ENSE00003554016
##    ...      ...               ...    ... .       ...             ...
##   [13]        X 99890555-99890743      - |    667156 ENSE00003512331
##   [14]        X 99891188-99891686      - |    667158 ENSE00001886883
##   [15]        X 99891605-99891803      - |    667159 ENSE00001855382
##   [16]        X 99891790-99892101      - |    667160 ENSE00001863395
##   [17]        X 99894942-99894988      - |    667161 ENSE00001828996
##   -------
##   seqinfo: 722 sequences (1 circular) from an unspecified genome

metadata() - access experiment wide unstructured metadata (e.g., experimental methods, publication references)

metadata(se)

## [[1]]
## Experiment data
##   Experimenter name: Himes BE 
##   Laboratory: NA 
##   Contact information:  
##   Title: RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells. 
##   URL: http://www.ncbi.nlm.nih.gov/pubmed/24926665 
##   PMIDs: 24926665 
## 
##   Abstract: A 226 word abstract is available. Use 'abstract' method.

```

Note

You can use str() to get an idea of how to access information from each slot. Notice the use of @.

Now that we know how to access the information stored in the object se, how can we subset it?

Subsetting and manipulating `SummarizedExperiments`

First, notice that you can easily access columns from the sample metadata (colData()) using $.

Using brackets to subset:

se$SampleName

## [1] GSM1275862 GSM1275863 GSM1275866 GSM1275867 GSM1275870 GSM1275871 GSM1275874
## [8] GSM1275875
## 8 Levels: GSM1275862 GSM1275863 GSM1275866 GSM1275867 ... GSM1275875

se$dex

## [1] untrt trt   untrt trt   untrt trt   untrt trt  
## Levels: trt untrt

levels(se$dex)

## [1] "trt"   "untrt"

We can also use 2D subsetting like we would for a data frame (i.e., [] notation), where left of the comma applies to the rows (features) and right of the comma applies to the columns (samples).

If we only wanted the first 10 transcripts and the first three samples, we could use

se[1:10,1:3]

## class: RangedSummarizedExperiment 
## dim: 10 3 
## metadata(1): ''
## assays(1): counts
## rownames(10): ENSG00000000003 ENSG00000000005 ... ENSG00000001084
##   ENSG00000001167
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(3): SRR1039508 SRR1039509 SRR1039512
## colData names(9): SampleName cell ... Sample BioSample

We can also apply conditions using [].

For example, if we only want to include treated samples.

se[ ,se$dex=="trt"] #notice the dimensions of the data have changed

## class: RangedSummarizedExperiment 
## dim: 63677 4 
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
##   ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(4): SRR1039509 SRR1039513 SRR1039517 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

se[ ,se$dex=="trt"]$dex #we can check to see that only trt samples remain

## [1] trt trt trt trt
## Levels: trt untrt

We can do the same with feature information. For example, if we only want to include protein_coding transcripts:

se[rowData(se)$gene_biotype == "protein_coding",]

## class: RangedSummarizedExperiment 
## dim: 22810 8 
## metadata(1): ''
## assays(1): counts
## rownames(22810): ENSG00000000003 ENSG00000000005 ... ENSG00000273482
##   ENSG00000273490
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

OR

We could filter using the count assay. For example, keeping only transcripts that have a count of at least 10 across a minimum of 4 samples.

transcripts_to_keep <- rowSums(assays(se)$counts >= 10) >= 4
se[transcripts_to_keep,]

## class: RangedSummarizedExperiment 
## dim: 16139 8 
## metadata(1): ''
## assays(1): counts
## rownames(16139): ENSG00000000003 ENSG00000000419 ... ENSG00000273486
##   ENSG00000273487
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

Check out the SummarizedExperiment vignette for more information!

Using `tidySummarizedExperiment`

Why did we focus so heavily on the tidyverse if it can't be used to manipulate Bioconductor objects? Well, for one, regardless of whether you are a user of Bioconductor packages, you will often manipulate data frames, which is where the tidyverse collection of packages is super useful. Second, you can actually use tidyverse commands in certain contexts. The R package, tidySummarizedExperiment allows you to view a SummarizedExperiment object as a tibble (a tidyverse data frame) and provides other tidyverse compatible functions from the packages dplyr, tidyr, ggplot, and plotly.

library(tidySummarizedExperiment)

# filter to samples in the untreated condition  
se %>% filter(dex == "untrt")

## # A SummarizedExperiment-tibble abstraction: 254,708 × 23
## # [90mFeatures=63677 | Samples=4 | Assays=counts[0m
##    .feature        .sample   counts SampleName cell  dex   albut Run   avgLength
##    <chr>           <chr>      <int> <fct>      <fct> <fct> <fct> <fct>     <int>
##  1 ENSG00000000003 SRR10395…    679 GSM1275862 N613… untrt untrt SRR1…       126
##  2 ENSG00000000005 SRR10395…      0 GSM1275862 N613… untrt untrt SRR1…       126
##  3 ENSG00000000419 SRR10395…    467 GSM1275862 N613… untrt untrt SRR1…       126
##  4 ENSG00000000457 SRR10395…    260 GSM1275862 N613… untrt untrt SRR1…       126
##  5 ENSG00000000460 SRR10395…     60 GSM1275862 N613… untrt untrt SRR1…       126
##  6 ENSG00000000938 SRR10395…      0 GSM1275862 N613… untrt untrt SRR1…       126
##  7 ENSG00000000971 SRR10395…   3251 GSM1275862 N613… untrt untrt SRR1…       126
##  8 ENSG00000001036 SRR10395…   1433 GSM1275862 N613… untrt untrt SRR1…       126
##  9 ENSG00000001084 SRR10395…    519 GSM1275862 N613… untrt untrt SRR1…       126
## 10 ENSG00000001167 SRR10395…    394 GSM1275862 N613… untrt untrt SRR1…       126
## # ℹ 40 more rows
## # ℹ 14 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
## #   gene_id <chr>, gene_name <chr>, entrezid <int>, gene_biotype <chr>,
## #   gene_seq_start <int>, gene_seq_end <int>, seq_name <chr>, seq_strand <int>,
## #   seq_coord_system <int>, symbol <chr>, GRangesList <list>

# total over all genes, per condition
se %>%
  group_by(dex) %>%
  summarize(total=sum(counts))

## # A tibble: 2 × 2
##   dex      total
##   <fct>    <int>
## 1 trt   85955244
## 2 untrt 89561179

# the features where mean expression > 0
se %>% 
  group_by(.feature) %>%
  mutate(mean_exprs = mean(counts)) %>%
  filter(mean_exprs > 0)

## # A tibble: 267,752 × 24
## # Groups:   .feature [33,469]
##    .feature        .sample   counts SampleName cell  dex   albut Run   avgLength
##    <chr>           <chr>      <int> <fct>      <fct> <fct> <fct> <fct>     <int>
##  1 ENSG00000000003 SRR10395…    679 GSM1275862 N613… untrt untrt SRR1…       126
##  2 ENSG00000000419 SRR10395…    467 GSM1275862 N613… untrt untrt SRR1…       126
##  3 ENSG00000000457 SRR10395…    260 GSM1275862 N613… untrt untrt SRR1…       126
##  4 ENSG00000000460 SRR10395…     60 GSM1275862 N613… untrt untrt SRR1…       126
##  5 ENSG00000000938 SRR10395…      0 GSM1275862 N613… untrt untrt SRR1…       126
##  6 ENSG00000000971 SRR10395…   3251 GSM1275862 N613… untrt untrt SRR1…       126
##  7 ENSG00000001036 SRR10395…   1433 GSM1275862 N613… untrt untrt SRR1…       126
##  8 ENSG00000001084 SRR10395…    519 GSM1275862 N613… untrt untrt SRR1…       126
##  9 ENSG00000001167 SRR10395…    394 GSM1275862 N613… untrt untrt SRR1…       126
## 10 ENSG00000001460 SRR10395…    172 GSM1275862 N613… untrt untrt SRR1…       126
## # ℹ 267,742 more rows
## # ℹ 15 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
## #   gene_id <chr>, gene_name <chr>, entrezid <int>, gene_biotype <chr>,
## #   gene_seq_start <int>, gene_seq_end <int>, seq_name <chr>, seq_strand <int>,
## #   seq_coord_system <int>, symbol <chr>, GRangesList <list>, mean_exprs <dbl>

Other packages uniting bioinformatics and the tidyverse

The following is not a comprehensive list:

tidySingleCellExperiment - for SingleCellExperiment objects
tidyseurat - for Seurat objects

Note

Seurat is not a Bioconductor package. It is a CRAN package.tidyseurat is also a CRAN package.
tidybulk - for analyzing RNA-seq data
tidyHeatmap - make heatmaps from tidy data - a CRAN package

Saving an R object

S4 objects store complex information that isn't necessarily simple to save. If you intend to work with the object further, try using saveRDS.

saveRDS(se, "airways.rds") #save the object

Note

saveRDS() is used to save a single object. To save multiple objects, use save(). To save your entire R environment, use save.image().

To load the object, back to your R environment, use readRDS().

rm(se) # remove the object from the environment
readRDS("./airways.rds") #load the object from file

## # A SummarizedExperiment-tibble abstraction: 509,416 × 23
## # [90mFeatures=63677 | Samples=8 | Assays=counts[0m
##    .feature        .sample   counts SampleName cell  dex   albut Run   avgLength
##    <chr>           <chr>      <int> <fct>      <fct> <fct> <fct> <fct>     <int>
##  1 ENSG00000000003 SRR10395…    679 GSM1275862 N613… untrt untrt SRR1…       126
##  2 ENSG00000000005 SRR10395…      0 GSM1275862 N613… untrt untrt SRR1…       126
##  3 ENSG00000000419 SRR10395…    467 GSM1275862 N613… untrt untrt SRR1…       126
##  4 ENSG00000000457 SRR10395…    260 GSM1275862 N613… untrt untrt SRR1…       126
##  5 ENSG00000000460 SRR10395…     60 GSM1275862 N613… untrt untrt SRR1…       126
##  6 ENSG00000000938 SRR10395…      0 GSM1275862 N613… untrt untrt SRR1…       126
##  7 ENSG00000000971 SRR10395…   3251 GSM1275862 N613… untrt untrt SRR1…       126
##  8 ENSG00000001036 SRR10395…   1433 GSM1275862 N613… untrt untrt SRR1…       126
##  9 ENSG00000001084 SRR10395…    519 GSM1275862 N613… untrt untrt SRR1…       126
## 10 ENSG00000001167 SRR10395…    394 GSM1275862 N613… untrt untrt SRR1…       126
## # ℹ 40 more rows
## # ℹ 14 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
## #   gene_id <chr>, gene_name <chr>, entrezid <int>, gene_biotype <chr>,
## #   gene_seq_start <int>, gene_seq_end <int>, seq_name <chr>, seq_strand <int>,
## #   seq_coord_system <int>, symbol <chr>, GRangesList <list>

Acknowledgements

Content from this lesson was inspired by or taken from the following sources:

Lesson 7: Introduction to Bioconductor -omics classes (containers)

Objectives

What is Bioconductor?

Important things to know about Bioconductor

Core infrastructure

What is object oriented programming?

Terms to know for object oriented programming with R

S4 objects and Bioconductor

What is the S4 system?

Introducing the SummarizedExperiment

Working with a summarized experiment.

Accessors

Subsetting and manipulating SummarizedExperiments

Using tidySummarizedExperiment

Other packages uniting bioinformatics and the tidyverse

Saving an R object

Acknowledgements

Subsetting and manipulating `SummarizedExperiments`

Using `tidySummarizedExperiment`