Bioconductor and Rmarkdown
Objectives
- To explore Bioconductor, a repository for R packages related to biological data analysis.
- To generate high quality data reports using R Markdown to make data analysis more reproducible.
Reminder: Uploading files from RStudio Server
Any files created by you today will be erased at the end of the session. You can upload any files you downloaded from the last session using the Upload
option in the Files pane.
Introducing Bioconductor
Bioconductor is a repository for R packages related to biological data analysis, primarily bioinformatics and computational biology, and as such it is a great place to search for -omics packages and pipelines.
Package types
Bioconductor packages are divided into four types:
-
software
-
annotation data
-
experiment data
-
workflows.
Software packages themselves can be subdivided into packages that provide infrastructure (i.e., classes) to store and access data, and packages that provide methodological tools to process data stored in those data structures. This separation of structure and analysis is at the core of the Bioconductor project, encouraging developers of new methodological software packages to thoughtfully re-use existing data containers where possible, and reducing the cognitive burden imposed on users who can more easily experiment with alternative workflows without the need to learn and convert between different data structures.
Annotation data packages provide self-contained databases of diverse genomic annotations (e.g., gene identifiers, biological pathways). Different collections of annotation packages can be found in the Bioconductor project. They are identifiable by their respective naming pattern, and the information that they contain. For instance, the so-called OrgDb packages (e.g., the org.Hs.eg.db package) provide information mapping different types of gene identifiers and pathway databases; the so-called EnsDb (e.g., EnsDb.Hsapiens.v86) packages encapsulate individual versions of the Ensembl annotations in Bioconductor packages; and the so-called TxDb packages (e.g., TxDb.Hsapiens.UCSC.hg38.knownGene) encapsulate individual versions UCSC gene annotation tables.
Experiment data packages provide self-contained datasets that are often used by software package developers to demonstrate the use of their package on well-known standard datasets in their package vignettes.
Finally, workflow packages exclusively provide collections of vignettes that demonstrate the combined usage of several other packages as a coherent workflow, but do not provide any new source code or functionality themselves.
--- Introduction to Bioconductor from The Bioconductor Project, a lesson in the Carpentries Incubator
For a comprehensive list of packages ranked by number of downloads, click here.
Bioconductor versions and install
Bioconductor release schedule
New versions of Bioconductor are released every 6 months and work with a specific version of R.
Because of this release schedule and associated automated testing, "each Bioconductor release provides a suite of packages that are mutually compatible, traceable, and guaranteed to function for the associated version of R." --- Introduction to Bioconductor from The Bioconductor Project.
The latest version of Bioconductor (Bioconductor 3.16) works with R version 4.2 for complete implementation. You may need to update your R installation.
How to install a Bioconductor package?
To install a Bioconductor package, you will first need to installBiocManager
, a CRAN package. You can then use BiocManager to install the Bioconductor core packages or any specific package.
To install the Bioconductor core packages, use the following:
#install core packages
if(!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
###if you just want to install BiocManager use:
install.packages("BiocManager")
To install a specific package:
BiocManager::install("tidybulk") #replace tidybulk with the name of
#the package that interests you.
The easiest way to search Bioconductor for a topic specific package is to use the BiocViews search. BiocViews includes a controlled vocabulary to categorize Bioconductor packages. Because packages are tagged using this vocabulary, they can be grouped and searched by topic. Here is an example searching for an RNAseq related package:
As you can see, the most popular packages are listed first.
Bioconductor education and communication
Resources for learning
There are a number of Bioconductor events/conferences throughout the year including the annual BioC conference in North America and similar regional conferences throughout the world (e.g., BioC Asia, BioC Europe). Upcoming events (e.g., conferences, workshops, courses, summer schools, etc.) can be found in the Events Calendar.
See the "Learn" card or Help on the Bioconductor website to find additional resources such as course materials, presentations, and vignettes.
Communication
For package support and questions on related topics, there is an active Bioconductor support site that operates similarly to other forums (e.g., Biostars).
There is also a Slack workspace for general community interaction with a range of channels. For example, important announcements are posted to the #general
channel in Slack.
Introducing R Markdown
For the purposes of reproducibility or collaboration, it is good practice to generate a report summarizing what has been done along with output results. This saves collaborators or your future self from trying to figure out how results were generated or from which script they were generated. Fortunately, there is rmarkdown
for easy reporting of R code, results, and interpretation. R Markdown is integrated within RStudio, and the rmarkdown
package can be installed using the following:
install.packages("rmarkdown")
In addition, R Markdown reports are dynamic, and as code is modified a new report can easily be generated using the knitr
package, which is also integrated into RStudio. The key to knitr
is a mixture of explanatory text with code chunks that are executed with each "knit" of the document.
install.packages("knitr")
Creating an Rmarkdown file
To create an Rmarkdown file, select the new file icon and then R Markdown.
A box will appear prompting for an author, title, and output format. Give your document an initial title and select the output that you want. Note: this information can be modified at any time.
Select OK
. A new R Markdown document should have been created.
Now you can begin generating a report. Thankfully, the document you just created includes some information to get you started, including some initial code chunks. I am NOT going to provide more detail on report generation here. There is extensive documentation only a google search away. See the resources section of this document for help. When you are ready to generate the output, whether an html, doc, or pdf, simply select the "Knit" button at the top of the page.
What is Quarto?
You may have noticed the option for Quarto documents and presentations in new versions of RStudio. "Quarto® is an open-source scientific and technical publishing system built on Pandoc" (quarto.org). It is a bit more dynamic than RMarkdown, especially regarding integration of different coding languages and report outputs.
Quarto is worth learning and will likely replace RMarkdown in the future.
Exporting files from RStudio
Remember, because we are using RStudio server through DNAnexus, any files created by you today will be erased at the end of the session.
To use the materials you generated on the RServer on DNAnexus on your local computer, you will need to export your files using More
and Export
in the Files
pane.
Acknowledgements
Material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org and The Bioconductor Project: Introduction to Bioconductor from the Carpentries Incubator.
Resources
- R markdown documentation from RStudio
- Other helpful resources, including comprehensive guides and cheatsheets can be accessed from here.