Introduction to R and RStudio
Introduction to R and RStudio IDE
Objectives
To understand:
1. the difference between R and RStudioIDE.
2. how to work within the RStudio environment including:
- creating an Rproject and Rscript
- navigating between directories
- using functions
- obtaining help
3. how R can enhance data analysis reproducibility
By the end of this section, you should be able to easily navigate and explore your RStudio environment.
What is R?
R is both a computational language and environment for statitical computing and graphics. It is open-source and widely used by scientists, not just bioinformaticians. Base packages of R are built into your initial installation, but R functionality is greatly improved by installing other packages. R as a programming language is based on the S language, developed by Bell laboratories. R is maintained by a network of collaborators from around the world, and core contributors are known as the R Core team (Term used for citations). However, R is also a resource for and by scientists, and R functionality makes it easy to develop and share packages on any topic. Check out more about R on The R Project for Statistical Computing website.
Why R?
R is a particularly great resource for statistical analyses, plotting, and report generating. The fact that it is widely used means that users do not need to reinvent the wheel. There is a package available for most types of analyses, and if users need help, it is only a Google search away. As of now, CRAN houses 18,944 available packages. There are also many field specific packages, including those useful in the -omics (genomics, transcriptomics, metabolomics, etc.). For example, the latest version of Bioconductor (v 3.16) includes 2,183 software packages, 416 experiment data packages, 909 annotation packages, 28 workflows, and 8 books
Where do we get R packages?
To take full advantage of R, you need to install R packages. R packages are loadable extensions that contain code, data, documentation, and tests in a standardized shareable format that can easily be installed by R users. The primary repository for R packages is the Comprehensive R Archive Network (CRAN). CRAN is a global network of servers that store identical versions of R code, packages, documentation, etc (cran.r-project.org). To install a CRAN package, use install.packages("packageName")
. Github is another common source used to store R packages; though, these packages do not necessarily meet CRAN standards so approach with caution. To install a Github packages use library(devtools)
followed by install_github()
. Many genomics and other packages useful to biologists / molecular biologists can be found on Bioconductor - more on this later.
METACRAN is a useful database that allows you to search and browse CRAN/R packages.
Ways to run R
R can be used via command line interactively, command line using a script, or interactively through an environment. This course will demonstrate the utility of the RStudio integrated development environment (IDE).
What is RStudio?
RStudio is an integrated development environment for R, and now python. RStudio includes a console, editor, and tools for plotting, history, debugging, and work space management. It provides a graphic user interface for working with R, thereby making R more user friendly. RStudio is open-source and can be installed locally or used through a browser (RStudio Server). We will be showcasing RStudio Server, but we highly encourage new users to install R and RStudio locally to their PC or macbook.
Note: RStudio the company is now Posit.
**Installing R and RStudio**
Detailed Instructions for installing R and RStudio can be found [here](https://btep.ccr.cancer.gov/docs/rtools/){target=_blank}.
Getting Started with R and R Studio
This tutorial closely follows the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org.
Creating a R project
Because we are working on DNAnexus, and our files will not remain at the end of each class, we aren't going to use a R project for all lessons. However, it is worth creating an R project and discussing the benefits here.
Creating an R project for each project you are working on facilitates organization and scientific reproducibility.
An RStudio project allows you to more easily:
- Save data, files, variables, packages, etc. related to a specific analysis project
- Restart work where you left off
- Collaborate, especially if you are using version control such as git. ---datacarpentry.org
R projects simplify data reproducibility by allowing us to use relative file paths that will translate well when sharing the project.
To start a new R project, select File
> New Project...
or use the R project button (See image below)
A New project wizard will appear. Click New Directory
and New Project
. Choose a new directory name....perhaps "LearningR"? To make your project more reproducible, consider clicking the option box for renv
. The R project file ends in .Rproj.
One of the most wonderful and also frustrating aspects of working with R is managing packages. Unfortunately it is very common that you may run into versions of R and/or R packages that are not compatible. This may make it difficult for someone to run your R script using their version of R or a given R package, and/or make it more difficult to run their scripts on your machine. renv is an RStudio add-on that will associate your packages and project so that your work is more portable and reproducible. To turn on renv click on the Tools menu and select Project Options. Under Environments check off “Use renv with this project” and follow any installation instructions. ---datacarpentry.org
Read more about renv
here.
Creating a R script
As we learn more about R and start learning our first commands, we will keep a record of our commands using an R script. Remember, good annotation is key to reproducible data analysis. An R script can also be generated to run on its own without user interaction, from R console using source()
and from linux command line using Rscript
.
To create an R script, click File
> New File
> R Script
. You can save your script by clicking on the floppy disk icon. You can name your script whatever you want, perhaps "LearningR_intro". R scripts end in .R. Save your R script to your working directory, which will be the default location on RStudio Server.
Introduction to the RStudio layout
Let's look a bit into our RStudio layout. (demonstrate minimize / maximize utility)
Source: This pane is where you will write/view R scripts. Some outputs (such as if you view a dataset using
View()
) will appear as a tab here.
Console/Terminal/Jobs: This is actually where you see the execution of commands. This is the same display you would see if you were using R at the command line without RStudio. You can work interactively (i.e. enter R commands here), but for the most part we will run a script (or lines in a script) in the source pane and watch their execution and output here. The “Terminal” tab give you access to the BASH terminal (the Linux operating system, unrelated to R). RStudio also allows you to run jobs (analyses) in the background. This is useful if some analysis will take a while to run. You can see the status of those jobs in the background.
Environment/History: Here, RStudio will show you what datasets and objects (variables) you have created and which are defined in memory. You can also see some properties of objects/datasets such as their type and dimensions. The “History” tab contains a history of the R commands you’ve executed R.
Files/Plots/Packages/Help/Viewer: This multipurpose pane will show you the contents of directories on your computer. You can also use the “Files” tab to navigate and set the working directory. The “Plots” tab will show the output of any plots generated. In “Packages” you will see what packages are actively loaded, or you can attach installed packages. “Help” will display help files for R functions and packages. “Viewer” will allow you to view local web content (e.g. HTML outputs).
---datacarpentry.org
Note: you can already see our R project and R script file in our project directory under the Files
tab. If you chose to use renv
you will also see some files and directories related to that.
Additional panes may show up depending on what you are doing in RStudio. For example, you may notice a Render
tab in the Console/Terminal/Jobs pane when working with Rmarkdown files (.Rmd).
Also, you can change your RStudio layout. See this blog if you are interested. For simplicity, please do NOT change the layout during this course.
When to use Source
vs Console
?
We will use the Source
pane to keep a record of the code that we run. However, at times, we may want to do quick testing without keeping a record. This is the scenario in which you would use the Console
.
Uploading and exporting files from RStudio Server
RStudio Server works via a web browser, and so you see this additional Upload
option in the Files pane. If you select this option, you can upload files from your local computer into the server environment. If you select More
, you will also see an Export
option. You can use this to export the files created in the RStudio environment.
Data Management
Data organization is extremely important to reproducible science. Consider organizing your project directory in a way that facilitates reproducibility. For example, you may want directories for data, drafts_documents, outputs, and scripts. See additional details in this lesson from Data Carpentries. How you organize project directories is up to you, but consistency is fairly important for reproducibility. We will discuss more on this subject when introducing data frames.
Saving your R environment (.Rdata)
When exiting RStudio, you will be prompted to save your R workspace or .RData. The .RData file saves the objects generated in your R environment. You can also save the .RData at any time using the floppy disk icon just below the Environment tab. You may also save your R workspace from the console using save.image()
. RData files are often not visible in a directory. You can see them using ls -a
from the terminal. RData files within a working directory associated with a given project will launch automatically under the default option Restore .RData into workspace at startup. You may also load .Rdata by using load()
.
If you are working with significantly large datasets, you may not want to automatically save and restore .RData. To turn this off, go to Tools -> Global Options -> deselect "Restore .RData into workspace at startup" and choose "Never" for "Save workspace to .RData on exit".
Navigating directories
Now we are ready to work with some of our first R commands. We are going to run commands directly from our R script rather than typing into the R console.
Our first command will be getwd()
. This simply prints your working directory and is the R equivalent of pwd
(if you know unix coding).
#print our working directory
getwd()
Run
button above. This will run highlighted or selected code. You may also use the source button to run your entire script. My preferred method is to use keyboard shortcuts. Move your cursor to the code of interest and use command
+ return
for macs or control
+ enter
for PCs. If a command is taking a long time to run and you need to cancel it, use control
+ c
from the command line or escape
in RStudio. Once you run the command, you will see the command print to the console in blue followed by the output. [1] "/home/rstudio/LearningR"
#
.
We set our working directory when we created our R project, but if for some reason we needed to set our working directory, we can do this with setwd()
. There is no need to run currently. However, if you were to run it, you would use the following notation:
setwd("/home/rstudio/Rlearning")
The path should be in quotes. You can use tab completion to fill in the path.
What is a path?
According to Wikipedia, a path is "a string of characters used to uniquely identify a location in a directory structure."
Therefore, a file path simply tells us where a file or files are located. You will need to direct R to the location of files that you want to work with or output that you create.
The working directory is the location in your file system that you are currently working in. In other words, it is the default location that R will look for input files and write output files.
Note: R uses unix formatting for directories, so regardless of whether you have a Windows computer or a mac, the way you enter the directory information will be the same. You can use tab completion to help you fill in directory information.
Using functions
A function in R (or any computing language) is a short program that takes some input and returns some output.
An R function has three key properties:
- Functions have a name (e.g. dir, getwd); note that functions are case sensitive!
- Following the name, functions have a pair of ()
- Inside the parentheses, a function may take 0 or more arguments --- datacarpentry.org
We have already used some R functions (e.g. getwd()
and setwd()
)! Let's look at another example using the round()
function. round()
"rounds the values in its first argument to the specified number of decimal places (default 0)" --- R help.
Consider
round(5.65) #can provide a single number
## [1] 6
round(c(5.65,7.68,8.23)) #can provide a vector
## [1] 6 8 8
round(5.65,digits=1) #provide an additional argument rounding to the tenths place
## [1] 5.7
At times a function may be masked by another function. This can happen if two functions are named the same (e.g., dplyr::filter()
vs plyr::filter()
). We can get around this by explicitly calling a function from the correct package using the following syntax: package::function()
.
Getting help
Now we know a bit about using functions, but what if I had no idea what the function round()
was used for or what arguments to provide?
Getting help in R is fairly easy. In the pane to the bottom right, you should see a Help
tab. You can search for help regarding a specific topic using the search field (look for the magnifying glass).
Alternatively, you can search directly for help in the console using ?round()
or ??round()
. help.search()
or ??
can be used to search for a function using a keyword and will also work for unloaded packages; for example, you may try help.search("anova")
.
R help pages provide a lot of information. The description and argument sections are likely where you will want to start. If you are still unsure how to use the function, scroll down and check out the examples section of the documentation. Consider testing some of the examples yourself and applying to your own data.
Many R packages also include more detailed help documentation known as a vignette. To see a package vignette, use browseVignettes()
(e.g., browseVignettes(package="dplyr")
).
To see a function's arguments, you can use args()
.
args(round)
## function (x, digits = 0)
## NULL
round()
takes two arguments, x, which is the number to be rounded, and a digits argument. The = sign indicates that a default (in this case 0) is already set. Since x is not set, round() requires we provide it, in contrast to digits where R will use the default value 0 unless you explicitly provide a different value. --- datacarpentry.org
R arguments are also positional, so instead of including digits=1 in our above use of round()
, we could instead do the following:
round(5.65, 1)
## [1] 5.7
Addtional Sources for help
Try googling your problem or using some other search engine. rseek
is an R specific search engine that searches several R related sites. If using google directly, make sure you use R to tag your search.
Stack Overflow is a particularly great resource for finding help. If you post a question, you will need to make a reproducible example (reprex) and be as descriptive as possible regarding the problem. For this purpose, you may find the reprex
package particularly useful.
To provide details about your R session, use
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.29 R6_2.5.1 jsonlite_1.8.0 magrittr_2.0.3
## [5] evaluate_0.15 stringi_1.7.6 rlang_1.0.2 cli_3.3.0
## [9] rstudioapi_0.13 jquerylib_0.1.4 bslib_0.3.1 rmarkdown_2.14
## [13] tools_4.1.2 stringr_1.4.0 xfun_0.36 yaml_2.3.5
## [17] fastmap_1.1.0 compiler_4.1.2 htmltools_0.5.2 knitr_1.41
## [21] sass_0.4.1
Test your learning
- Which of the following functions is used to print your working directory in R?
a. pwd
b. Setwd()
c. getwd()
d. wkdir()
Solution
C
- Which of the following can be used to learn more regarding an R function?
a. ?function()
b. ??function()
c. args(function)
d. All of the above
Solution
D
Acknowledgments
Material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org. Material was also inspired by content from Introduction to data analysis with R and Bioconductor, which is part of the Carpentries Incubator.