Introduction to R and RStudio

Introduction to R and RStudio IDE

Objectives
To understand:
1. the difference between R and RStudioIDE.
2. how to work within the RStudio environment including:

creating an Rproject and Rscript
navigating between directories
using functions
obtaining help

3. how R can enhance data analysis reproducibility

By the end of this section, you should be able to easily navigate and explore your RStudio environment.

What is R?

R is both a computational language and environment for statitical computing and graphics. It is open-source and widely used by scientists, not just bioinformaticians. Base packages of R are built into your initial installation, but R functionality is greatly improved by installing other packages. R as a programming language is based on the S language, developed by Bell laboratories. R is maintained by a network of collaborators from around the world, and core contributors are known as the R Core team (Term used for citations). However, R is also a resource for and by scientists, and R functionality makes it easy to develop and share packages on any topic. Check out more about R on The R Project for Statistical Computing website.

Why R?

R is a particularly great resource for statistical analyses, plotting, and report generating. The fact that it is widely used means that users do not need to reinvent the wheel. There is a package available for most types of analyses, and if users need help, it is only a Google search away. As of now, CRAN houses 18,944 available packages. There are also many field specific packages, including those useful in the -omics (genomics, transcriptomics, metabolomics, etc.). For example, the latest version of Bioconductor (v 3.16) includes 2,183 software packages, 416 experiment data packages, 909 annotation packages, 28 workflows, and 8 books

Where do we get R packages?

To take full advantage of R, you need to install R packages. R packages are loadable extensions that contain code, data, documentation, and tests in a standardized shareable format that can easily be installed by R users. The primary repository for R packages is the Comprehensive R Archive Network (CRAN). CRAN is a global network of servers that store identical versions of R code, packages, documentation, etc (cran.r-project.org). To install a CRAN package, use install.packages("packageName"). Github is another common source used to store R packages; though, these packages do not necessarily meet CRAN standards so approach with caution. To install a Github packages use library(devtools) followed by install_github(). Many genomics and other packages useful to biologists / molecular biologists can be found on Bioconductor - more on this later.

METACRAN is a useful database that allows you to search and browse CRAN/R packages.

Ways to run R

R can be used via command line interactively, command line using a script, or interactively through an environment. This course will demonstrate the utility of the RStudio integrated development environment (IDE).

What is RStudio?

RStudio is an integrated development environment for R, and now python. RStudio includes a console, editor, and tools for plotting, history, debugging, and work space management. It provides a graphic user interface for working with R, thereby making R more user friendly. RStudio is open-source and can be installed locally or used through a browser (RStudio Server). We will be showcasing RStudio Server, but we highly encourage new users to install R and RStudio locally to their PC or macbook.

Note: RStudio the company is now Posit.

**Installing R and RStudio**

Detailed Instructions for installing R and RStudio can be found [here](https://btep.ccr.cancer.gov/docs/rtools/){target=_blank}.

Getting Started with R and R Studio

This tutorial closely follows the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org.

Creating a R project

Because we are working on DNAnexus, and our files will not remain at the end of each class, we aren't going to use a R project for all lessons. However, it is worth creating an R project and discussing the benefits here.

Creating an R project for each project you are working on facilitates organization and scientific reproducibility.

An RStudio project allows you to more easily:

Save data, files, variables, packages, etc. related to a specific analysis project

Restart work where you left off

Collaborate, especially if you are using version control such as git. ---datacarpentry.org

R projects simplify data reproducibility by allowing us to use relative file paths that will translate well when sharing the project.

To start a new R project, select File > New Project... or use the R project button (See image below)

A New project wizard will appear. Click New Directory and New Project. Choose a new directory name....perhaps "LearningR"? To make your project more reproducible, consider clicking the option box for renv. The R project file ends in .Rproj.

One of the most wonderful and also frustrating aspects of working with R is managing packages. Unfortunately it is very common that you may run into versions of R and/or R packages that are not compatible. This may make it difficult for someone to run your R script using their version of R or a given R package, and/or make it more difficult to run their scripts on your machine. renv is an RStudio add-on that will associate your packages and project so that your work is more portable and reproducible. To turn on renv click on the Tools menu and select Project Options. Under Environments check off “Use renv with this project” and follow any installation instructions. ---datacarpentry.org

Creating a R script

As we learn more about R and start learning our first commands, we will keep a record of our commands using an R script. Remember, good annotation is key to reproducible data analysis. An R script can also be generated to run on its own without user interaction, from R console using source() and from linux command line using Rscript.

To create an R script, click File > New File > R Script. You can save your script by clicking on the floppy disk icon. You can name your script whatever you want, perhaps "LearningR_intro". R scripts end in .R. Save your R script to your working directory, which will be the default location on RStudio Server.

Introduction to the RStudio layout

Let's look a bit into our RStudio layout. (demonstrate minimize / maximize utility)

Source: This pane is where you will write/view R scripts. Some outputs (such as if you view a dataset using View()) will appear as a tab here.
Console/Terminal/Jobs: This is actually where you see the execution of commands. This is the same display you would see if you were using R at the command line without RStudio. You can work interactively (i.e. enter R commands here), but for the most part we will run a script (or lines in a script) in the source pane and watch their execution and output here. The “Terminal” tab give you access to the BASH terminal (the Linux operating system, unrelated to R). RStudio also allows you to run jobs (analyses) in the background. This is useful if some analysis will take a while to run. You can see the status of those jobs in the background.
Environment/History: Here, RStudio will show you what datasets and objects (variables) you have created and which are defined in memory. You can also see some properties of objects/datasets such as their type and dimensions. The “History” tab contains a history of the R commands you’ve executed R.
Files/Plots/Packages/Help/Viewer: This multipurpose pane will show you the contents of directories on your computer. You can also use the “Files” tab to navigate and set the working directory. The “Plots” tab will show the output of any plots generated. In “Packages” you will see what packages are actively loaded, or you can attach installed packages. “Help” will display help files for R functions and packages. “Viewer” will allow you to view local web content (e.g. HTML outputs).
---datacarpentry.org

Note: you can already see our R project and R script file in our project directory under the Files tab. If you chose to use renv you will also see some files and directories related to that.

Additional panes may show up depending on what you are doing in RStudio. For example, you may notice a Render tab in the Console/Terminal/Jobs pane when working with Rmarkdown files (.Rmd).

Also, you can change your RStudio layout. See this blog if you are interested. For simplicity, please do NOT change the layout during this course.

When to use `Source` vs `Console`?

We will use the Source pane to keep a record of the code that we run. However, at times, we may want to do quick testing without keeping a record. This is the scenario in which you would use the Console.

Uploading and exporting files from RStudio Server

RStudio Server works via a web browser, and so you see this additional Upload option in the Files pane. If you select this option, you can upload files from your local computer into the server environment. If you select More, you will also see an Export option. You can use this to export the files created in the RStudio environment.

Upload/Export

Data Management

Data organization is extremely important to reproducible science. Consider organizing your project directory in a way that facilitates reproducibility. For example, you may want directories for data, drafts_documents, outputs, and scripts. See additional details in this lesson from Data Carpentries. How you organize project directories is up to you, but consistency is fairly important for reproducibility. We will discuss more on this subject when introducing data frames.

Saving your R environment (.Rdata)

When exiting RStudio, you will be prompted to save your R workspace or .RData. The .RData file saves the objects generated in your R environment. You can also save the .RData at any time using the floppy disk icon just below the Environment tab. You may also save your R workspace from the console using save.image(). RData files are often not visible in a directory. You can see them using ls -a from the terminal. RData files within a working directory associated with a given project will launch automatically under the default option Restore .RData into workspace at startup. You may also load .Rdata by using load().

If you are working with significantly large datasets, you may not want to automatically save and restore .RData. To turn this off, go to Tools -> Global Options -> deselect "Restore .RData into workspace at startup" and choose "Never" for "Save workspace to .RData on exit".

Navigating directories

Now we are ready to work with some of our first R commands. We are going to run commands directly from our R script rather than typing into the R console.

Our first command will be getwd(). This simply prints your working directory and is the R equivalent of pwd (if you know unix coding).

#print our working directory
getwd()

To run this command, we have a number of options. First, you can use the Run button above. This will run highlighted or selected code. You may also use the source button to run your entire script. My preferred method is to use keyboard shortcuts. Move your cursor to the code of interest and use command + return for macs or control + enter for PCs. If a command is taking a long time to run and you need to cancel it, use control + c from the command line or escape in RStudio. Once you run the command, you will see the command print to the console in blue followed by the output.

[1] "/home/rstudio/LearningR"

It is good practice to annotate your code using a comment. We can denote comments with #.

We set our working directory when we created our R project, but if for some reason we needed to set our working directory, we can do this with setwd(). There is no need to run currently. However, if you were to run it, you would use the following notation:

setwd("/home/rstudio/Rlearning")

The path should be in quotes. You can use tab completion to fill in the path.

What is a path?

According to Wikipedia, a path is "a string of characters used to uniquely identify a location in a directory structure."

Therefore, a file path simply tells us where a file or files are located. You will need to direct R to the location of files that you want to work with or output that you create.

The working directory is the location in your file system that you are currently working in. In other words, it is the default location that R will look for input files and write output files.

Note: R uses unix formatting for directories, so regardless of whether you have a Windows computer or a mac, the way you enter the directory information will be the same. You can use tab completion to help you fill in directory information.

Using functions

A function in R (or any computing language) is a short program that takes some input and returns some output.

An R function has three key properties:

Functions have a name (e.g. dir, getwd); note that functions are case sensitive!

Following the name, functions have a pair of ()

Inside the parentheses, a function may take 0 or more arguments --- datacarpentry.org

We have already used some R functions (e.g. getwd() and setwd())! Let's look at another example using the round() function. round() "rounds the values in its first argument to the specified number of decimal places (default 0)" --- R help.

Consider

round(5.65) #can provide a single number

## [1] 6

round(c(5.65,7.68,8.23)) #can provide a vector

## [1] 6 8 8

In this example, we only provided the required argument in this case, which was any numeric or complex vector. We can see that two arguments can be included by the context prompt while typing (See below image). The optional second argument (i.e., digits) indicates the number of decimal places to round to. Contextual help is generally provided as you type the name of a function. We will discuss other types of help in a moment.

Contextual help

round(5.65,digits=1) #provide an additional argument rounding to the tenths place

## [1] 5.7

At times a function may be masked by another function. This can happen if two functions are named the same (e.g., dplyr::filter() vs plyr::filter()). We can get around this by explicitly calling a function from the correct package using the following syntax: package::function().

Getting help

Now we know a bit about using functions, but what if I had no idea what the function round() was used for or what arguments to provide?

Getting help in R is fairly easy. In the pane to the bottom right, you should see a Help tab. You can search for help regarding a specific topic using the search field (look for the magnifying glass).

Alternatively, you can search directly for help in the console using ?round() or ??round(). help.search() or ?? can be used to search for a function using a keyword and will also work for unloaded packages; for example, you may try help.search("anova").

R help pages provide a lot of information. The description and argument sections are likely where you will want to start. If you are still unsure how to use the function, scroll down and check out the examples section of the documentation. Consider testing some of the examples yourself and applying to your own data.

Many R packages also include more detailed help documentation known as a vignette. To see a package vignette, use browseVignettes() (e.g., browseVignettes(package="dplyr")).

To see a function's arguments, you can use args().

args(round)

## function (x, digits = 0) 
## NULL

round() takes two arguments, x, which is the number to be rounded, and a digits argument. The = sign indicates that a default (in this case 0) is already set. Since x is not set, round() requires we provide it, in contrast to digits where R will use the default value 0 unless you explicitly provide a different value. --- datacarpentry.org

R arguments are also positional, so instead of including digits=1 in our above use of round(), we could instead do the following:

round(5.65, 1)

## [1] 5.7

Addtional Sources for help

Try googling your problem or using some other search engine. rseek is an R specific search engine that searches several R related sites. If using google directly, make sure you use R to tag your search.

Stack Overflow is a particularly great resource for finding help. If you post a question, you will need to make a reproducible example (reprex) and be as descriptive as possible regarding the problem. For this purpose, you may find the reprex package particularly useful.

To provide details about your R session, use

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.29   R6_2.5.1        jsonlite_1.8.0  magrittr_2.0.3 
##  [5] evaluate_0.15   stringi_1.7.6   rlang_1.0.2     cli_3.3.0      
##  [9] rstudioapi_0.13 jquerylib_0.1.4 bslib_0.3.1     rmarkdown_2.14 
## [13] tools_4.1.2     stringr_1.4.0   xfun_0.36       yaml_2.3.5     
## [17] fastmap_1.1.0   compiler_4.1.2  htmltools_0.5.2 knitr_1.41     
## [21] sass_0.4.1

Test your learning

Which of the following functions is used to print your working directory in R?
a. pwd
b. Setwd()
c. getwd()
d. wkdir()

Solution

C

Which of the following can be used to learn more regarding an R function?
a. ?function()
b. ??function()
c. args(function)
d. All of the above

Solution

D

Acknowledgments

Material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org. Material was also inspired by content from Introduction to data analysis with R and Bioconductor, which is part of the Carpentries Incubator.

Introduction to R and RStudio