Introduction to Bioconductor and report generation with R
Objectives
- To explore Bioconductor, a repository for R packages related to biological data analysis.
- To learn about options for report generation with R: RMarkdown and Quarto.
Introducing Bioconductor
Bioconductor is both an open source project and repository for R packages related to the analysis of biological data, primarily bioinformatics and computational biology, and as such it is a great place to search for -omics packages and pipelines. Read more about the goals of the Bioconductor project here.
Since its inception in 2001, the Bioconductor project has kept pace with emerging technologies from microarrays to spatial transcriptomics.
The current release of Bioconductor (v 3.18) contains:
- 2,266 software packages
- 429 experiment data packages
- 920 annotation packages
- 30 workflows
- 4 books
What types of packages are available in Bioconductor?
Bioconductor packages are divided into four types:
-
software
-
annotation data
-
experiment data
-
workflows.
Software packages themselves can be subdivided into packages that provide infrastructure (i.e., classes) to store and access data, and packages that provide methodological tools to process data stored in those data structures. This separation of structure and analysis is at the core of the Bioconductor project, encouraging developers of new methodological software packages to thoughtfully re-use existing data containers where possible, and reducing the cognitive burden imposed on users who can more easily experiment with alternative workflows without the need to learn and convert between different data structures.
Annotation data packages provide self-contained databases of diverse genomic annotations (e.g., gene identifiers, biological pathways). Different collections of annotation packages can be found in the Bioconductor project. They are identifiable by their respective naming pattern, and the information that they contain. For instance, the so-called OrgDb packages (e.g., the org.Hs.eg.db package) provide information mapping different types of gene identifiers and pathway databases; the so-called EnsDb (e.g., EnsDb.Hsapiens.v86) packages encapsulate individual versions of the Ensembl annotations in Bioconductor packages; and the so-called TxDb packages (e.g., TxDb.Hsapiens.UCSC.hg38.knownGene) encapsulate individual versions UCSC gene annotation tables.
Experiment data packages provide self-contained datasets that are often used by software package developers to demonstrate the use of their package on well-known standard datasets in their package vignettes.
Finally, workflow packages exclusively provide collections of vignettes that demonstrate the combined usage of several other packages as a coherent workflow, but do not provide any new source code or functionality themselves.
--- Introduction to Bioconductor from The Bioconductor Project, a lesson in the Carpentries Incubator
For a comprehensive list of packages ranked by number of downloads, click here.
Bioconductor versions and install
Bioconductor release schedule
New versions of Bioconductor are released every 6 months and work with a specific version of R.
Because of this release schedule and associated automated testing, "each Bioconductor release provides a suite of packages that are mutually compatible, traceable, and guaranteed to function for the associated version of R." --- Introduction to Bioconductor from The Bioconductor Project.
The latest version of Bioconductor (Bioconductor 3.18) works with R version 4.3. You may need to update your R installation.
How to install a Bioconductor package?
To install a Bioconductor package, you will first need to installBiocManager
, a CRAN package. You can then use BiocManager to install the Bioconductor core packages and specific packages.
To install the Bioconductor core packages, use the following:
#install core packages
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.18")
To install a specific package:
BiocManager::install("tidybulk") #replace tidybulk with the name of
#the package that interests you.
To update installed Bioconductor packages, use:
BiocManager::install()
How to find Bioconductor packages of interest?
The easiest way to search Bioconductor for a topic specific package is to use the BiocViews search. BiocViews includes a controlled vocabulary to categorize Bioconductor packages. Because packages are tagged using this vocabulary, they can be grouped and searched by topic.
Packages are ranked. The more popular the package, the lower the rank.
Bioconductor education and communication
Resources for learning
There are a number of Bioconductor events/conferences throughout the year including the annual BioC conference in North America and similar regional conferences throughout the world (e.g., BioC Asia, BioC Europe). Upcoming events (e.g., conferences, workshops, courses, summer schools, etc.) can be found at the bottom of the home page or in the Events Calendar.
Upcoming Events on the Bioconductor homepage
See the "Learn" tab or card on the Bioconductor website to find additional resources such as course materials, presentations, and vignettes.
You could also use browseVignettes()
to search for vignettes directly from R.
Communication
For package support and questions on related topics, there is an active Bioconductor support site that operates similarly to other forums (e.g., Biostars).
There is also a Slack workspace for general community interaction with a range of channels. For example, important announcements are posted to the #general
channel in Slack.
Introduction to report generation with R.
Reproducibility in science means being able to generate the same experimental / analytical results with a high degree of reliability. This is necessary for research validation, scientific and public trust, innovation, and collaboration.
Reproducibility is not possible without complete transparency and exceptional documentation of all research steps (i.e., from the lab bench to the computer).
On the other hand, reusability refers to the reuse of data, methods, or workflows either for validation or new purposes. Reusability is important for applying methods to new problems, standardizing methodologies, and advancing discovery.
Reusability is also not possible without exceptional documentation.
We can make our research more reproducible and our data and methods more reusable by documenting, documenting, and documenting more...along with other steps (e.g., version control, containerization, etc.).
There are two report generating systems built into RStudio:
Both R Markdown and Quarto support dozens of static and interactive output formats and allow the user to execute code within a larger narrative. Because Quarto is the next generation of R Markdown, that will be the focus here.
What is Quarto?
Quarto® is an open-source scientific and technical publishing system built on Pandoc ---https://quarto.org/
What does this mean? Quarto allows you to combine code, commentary, and other features to tell a story about your data or data analysis using articles, presentations, dashboards, websites, blogs, or books. Click here for a list of supported Pandoc output formats.
Unlike R Markdown, Quarto is NOT an R package but instead is a command line tool.
Why use Quarto
Quarto helps you tell others exactly what you did and how you derived your conclusions - code, results, and conclusions wrapped up in a single document.
Advantages of Quarto compared with other publishing systems:
- Can use with the IDE or editor of your choice: Visual Studio Code, RStudio, JupyterLab/Jupyter notebook, other.
- Does not require R / RStudio.
- Can use directly from the command line.
- Language agnostic; can use the language of your choice (R, python, Julia, Bash, Observable) and can mix languages in a single document (R, Python, Bash, Observable).
- Easy to share with collaborators who prefer a different language or for mixed language projects.
- Better defaults; consistent syntax and approach across languages.
- Similar to RMarkdown but with fewer dependencies, greater consistency, and more flexibility.
Note
Quarto can render most RMarkdown (.Rmd) and Jupyter notebook files (.ipynb) out of the box. No edits necessary. This makes it an excellent tool for collaboration.
If you are already invested in R Markdown, you may want to stick with it. For now, there is no plan to discontinue R Markdown, but there will be no further development. BUT, if you are just getting started with documenting your data analyses and / or you are working on a highly collaborative project, Quarto is a good choice.
Gallery of examples
Let's check out some examples.
Examples of Quarto report types from quarto.org
The Quarto gallery includes many examples of various documentation types. Click on the link to explore more!
Getting Started
Quarto is installed with the latest versions of RStudio. When rendering a Quarto document (.qmd file), the code blocks are processed using either knitr
or jupyter
, which is converted to markdown. That markdown is then converted to the final format using pandoc.
Image modified from quarto.org
Markdown
Quarto uses markdown for formatting text, images, links, code, and other components in plain text documents. It is helpful to know some amount of markdown to get started, but Quarto can also be used similar to word processor (using a visual editor).
Open a new .qmd file
To get started with Quarto in RStudio, navigate to:
File
> New File
> Quarto Document
This will open a window to easily modify initial options. Here, we can select Quarto outputs such as a document, presentation, or interactive, and the output format (e.g., for a document, html, pdf, word). We can adjust the engine (knitr or jupyter), and our choice of editor (source vs visual editor).
Don't know markdown? No problem. Use the Visual editor.
One of the great things about Quarto is that you do not really need to know markdown to use it. You can use a "What you see is what you mean (WYSIWYM)" editing interface. This provides an editor toolbar along with other shortcuts to enhance the editing process.
Note
The visual editor can be used along with markdown syntax. They do not need to be mutually exclusive.
You can switch between the visual editor and the source editor at the top of the document.
A new Quarto document in RStudio, will include example text to help get you started.
Anatomy of Quarto document
Once you have intiated your document, you can get started documenting your analysis.
There are three basic components to a quarto document:
-
yaml header (bracketed by
---
)The yaml header or file allows us to control document level or project level options. Here, we can specify formats, themes, executable options, and others.
-
markdown text (images, tables, text, etc.)
Your narrative including images, tables, text, and other elements can be added using markdown syntax or using the visual editor.
-
code chunks (bracketed by
```
)Code blocks can be added using
```{r}```
or```{python}```
or```{bash}```
. Python requires thereticulate
package when usingknitr
.How code blocks and associated output behave can be modified using code chunk options denoted by
#|
. Many options can also be applied to all code chunks in the yaml header.
Happy documenting!
Addtional Resources
If interested in Quarto, check out our recent Coding Club session or navigate to quarto.org and check out the documentation.
Acknowledgements
Material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org and The Bioconductor Project: Introduction to Bioconductor from the Carpentries Incubator.