Lesson 1: Introduction to R and RStudio IDE

Learning Objectives

To understand:
1. the difference between R and RStudioIDE.
2. how to work within the RStudio environment including:

creating an Rproject and Rscript
navigating between directories
using functions
obtaining help

By the end of this section, you should be able to easily navigate and explore your RStudio environment.

What is R?

R is both a computational language and environment for statistical computing and graphics. It is open-source and widely used by scientists and non-scientists, not just bioinformaticians. Base packages of R are built into your initial installation, but R functionality is greatly improved by installing other packages. R as a programming language is based on the S language, developed by Bell laboratories. R is maintained by a network of collaborators from around the world, and core contributors are known as the R Core team (Term used for citations). However, R is also a resource for and by scientists, and R functionality makes it easy to develop and share packages on any topic. Check out more about R on The R Project for Statistical Computing website.

Why R?

R is a particularly great resource for statistical analyses, plotting, and report generating. The fact that it is widely used means that users do not need to reinvent the wheel. There is a package available for most types of analyses, and if users need help, it is only a Google search away. As of now, CRAN houses +22,000 available packages. There are also many field specific packages, including those useful in the -omics (genomics, transcriptomics, metabolomics, etc.). For example, the latest version of Bioconductor (v 3.20) includes 2,289 software packages, 431 experiment data packages, 928 annotation packages, 30 workflows, and 5 books.

Where do we get R packages?

To take full advantage of R, you need to install R packages. R packages are loadable extensions that contain code, data, documentation, and tests in a standardized, easy to share format that can easily be installed by R users. The primary repository for R packages is the Comprehensive R Archive Network (CRAN). CRAN is a global network of servers that store identical versions of R code, packages, documentation, etc (cran.r-project.org). To install a CRAN package, use install.packages("packageName"). Github is another common source used to store R packages; though, these packages do not necessarily meet CRAN standards so approach with caution. To install a Github packages use library(devtools) followed by install_github(). Many genomics and other packages useful to biologists / molecular biologists can be found on Bioconductor. Bioconductor and Bioconductor packages use BiocManager for installation; see here.

METACRAN is a useful database that allows you to search and browse CRAN/R packages.

Ways to run R

R is a programming language and it "comes with an environment or console that can read and execute your code". R can be used via command line interactively, command line using a script, or interactively through an environment. This course will demonstrate the utility of the RStudio integrated development environment (IDE).

What is RStudio?

RStudio is an integrated development environment for R, and now python. RStudio includes a console, editor, and tools for plotting, history, debugging, and work space management. It provides a graphic user interface for working with R, thereby making R more user friendly. RStudio is open-source and can be installed locally or used through a browser (RStudio Server or Posit Cloud). We will be showcasing RStudio Server on Biowulf via HPC Open OnDemand, but we highly encourage new users to install R and RStudio locally to their PC or macbook.

What is Posit?

Posit is a company that creates and maintains a variety of software products (some free and others proprietary) including the RStudio IDE.

Installing R and RStudio

Macbook: Follow these instructions.
Windows: Request installation from service.cancer.gov.

Check out this blog for information related to updating R and RStudio.

There is also an RStudio User Guide.

Getting Started with R and R Studio

This tutorial closely follows the "Intro to R and RStudio for Genomics" lesson provided by datacarpentry.org.

Connect to RStudio on NIH HPC Open OnDemand

NIH HPC Open OnDemand provides an online dashboard for users to easily access command line interactive sessions, graphical linux desktop environments, and interactive applications including RStudio, MATLAB, IGV, iDEP, VS Code, and Jupyter Notebook. To use NIH HPC Open OnDemand, you must have an NIH HPC account. If you are interested in bioinformatics, an NIH HPC account is highly recommended. These accounts are available for a nominal fee of $40 per month.

To connect to Open OnDemand make sure you are on the NIH Network and click on the following link: https://hpcondemand.nih.gov. This will take you to the HPC Open OnDemand dashboard.

From there you will need to:

Select RStudio Server.

Step 1: Select RStudio Server from the selection of pinned applications.
Select parameters for your RStudio session including the version of R you want to use.
Click "Launch" to start the session.

Step 2, 3: Alter any job parameters as you see fit and launch the session.

Your session will be queued, and it may take a few minutes to shift to "Running".

Session is queued.
When the session switches to "Running", click "Connect to RStudio Server".

Step 4: Connect to RStudio Server.

Congratulations! You are now connected.

RStudio Server on Biowulf

Using RStudio Server on Biowulf will allow you to 1. interact with your files on Biowulf, 2. use HPC resources (CPUs, RAM, etc.), and 3. also interact with local files.

Creating an R project

If you intend to use R for upcoming analysis projects, you will want to create R projects. R projects automatically set your working directory to the directory specified for a given project. R projects are beneficial because they "keep all the files associated with a given project (input data, R scripts, analytical results, and figures) together in one directory".

Creating an R project for each project you are working on facilitates organization and scientific reproducibility.

An RStudio project allows you to more easily:

Save data, files, variables, packages, etc. related to a specific analysis project

Restart work where you left off

Collaborate, especially if you are using version control such as git. --- datacarpentry.org

R projects simplify data reproducibility by allowing us to use relative file paths that will translate well when sharing the project.

To start a new R project, select File > New Project... or use the R project button (See image below).

R project

A New project wizard will appear. Click New Directory and New Project. Choose a new directory name....perhaps "Getting_Started_with_R"?

While we will not select renv today, this option will make a project more reproducible. See below. To make your project more reproducible, consider clicking the option box for renv.

The R project file ends in .Rproj. "This file contains various project options and can also be used as a shortcut for opening the project directly from the filesystem."

Why renv?

R projects allow us to easily share data, code, and other related information, but this only scratches the surface of what is required for true data analysis reproducibility.

Too often an R script will fail simply due to a clash in package dependencies. Versions are important. R versions change over time; Bioconductor versions evolve, and R packages change. While we can include session info using the sessionInfo() function (more on functions later) at the end of a script or markdown file, this in no way facilitates our ability to truly replicate the infrastructure surrounding our code. Thankfully, there are R packages available that help us do just that.

"The renv package helps you create reproducible environments for your R projects", primarily by tracking and managing package dependencies.

Creating an R script

As we learn more about R and start learning our first commands, we will keep a record of our commands using an R script. Remember, good annotation is key to reproducible data analysis. An R script can also be generated to run on its own without user interaction, from R console using source() and from linux command line using Rscript.

To create an R script, click File > New File > R Script. You can save your script by clicking on the floppy disk icon. You can name your script whatever you want, perhaps "Lesson_1". R scripts end in .R. Save your R script to your working directory, which will be the default location on RStudio Server.

Introduction to the RStudio layout

Let's look a bit into our RStudio layout.

Source: This pane is where you will write/view R scripts. Some outputs (such as if you view a dataset using View()) will appear as a tab here.
Console/Terminal/Jobs: This is actually where you see the execution of commands. This is the same display you would see if you were using R at the command line without RStudio. You can work interactively (i.e. enter R commands here), but for the most part we will run a script (or lines in a script) in the source pane and watch their execution and output here. The “Terminal” tab give you access to the BASH terminal (the Linux operating system, unrelated to R). RStudio also allows you to run jobs (analyses) in the background. This is useful if some analysis will take a while to run. You can see the status of those jobs in the background.
Environment/History: Here, RStudio will show you what datasets and objects (variables) you have created and which are defined in memory. You can also see some properties of objects/datasets such as their type and dimensions. The “History” tab contains a history of the R commands you’ve executed.
Files/Plots/Packages/Help/Viewer: This multi-purpose pane will show you the contents of directories on your computer. You can also use the “Files” tab to navigate and set the working directory. The “Plots” tab will show the output of any plots generated. In “Packages” you will see what packages are actively loaded, or you can attach installed packages. “Help” will display help files for R functions and packages. “Viewer” will allow you to view local web content (e.g. HTML outputs).
---datacarpentry.org

Look under the files tab

You can already see our R project and R script file in our project directory under the Files tab. If you chose to use renv you will also see some files and directories related to that.

Additional panes may show up depending on what you are doing in RStudio. For example, you may notice a Render tab in the Console/Terminal/Jobs pane when working with Rmarkdown (.Rmd) or Quarto (.qmd) files.

Also, you can change your RStudio layout. See this blog if you are interested. For simplicity, please do NOT change the layout during this course.

When to use `Source` vs `Console`?

We will use the Source pane to keep a record of the code that we run. However, at times, we may want to do quick testing without keeping a record. This is the scenario in which you would use the Console.

Uploading and exporting files from RStudio Server

RStudio Server works via a web browser, and so you see this additional Upload option in the Files pane. If you select this option, you can upload files from your local computer into the server environment. If you select More, you will also see an Export option. You can use this to export files to your local computer.

Upload/Export

Data Management

Data organization is extremely important to reproducible science. Consider organizing your project directory in a way that facilitates reproducibility. All inputs and outputs (where possible) should be contained within the project directory, and a consistent directory structure should be created. For example, you may want directories for data, docs, outputs, figures, and scripts. See additional details here. How you organize project directories is up to you, but consistency is fairly important for reproducibility. We will discuss more on this subject when introducing data frames.

Use relative file paths

Do not use absolute file paths in scripts. These will cause the script to fail unexpectedly for other users.

Saving your R environment (.Rdata)

When exiting RStudio, you will be prompted to save your R workspace or .RData. The .RData file saves the objects generated in your R environment. You can also save the .RData at any time using the floppy disk icon just below the Environment tab. You may also save your R workspace from the console using save.image(). RData files are often not visible in a directory. You can see them using ls -a from the terminal. RData files within a working directory associated with a given project will launch automatically under the default option Restore .RData into workspace at startup. You may also load .Rdata by using load().

Restoring your R environment

If you are working with significantly large datasets, you may not want to automatically save and restore .RData. To turn this off, go to Tools -> Global Options -> deselect "Restore .RData into workspace at startup" and choose "Never" for "Save workspace to .RData on exit". It is usually recommended not to restore the .RData file at the beginning of a session.

Another file to be aware of is the .Rhistory file. The R history file contains a list of commands from your previous R sessions.

What is a function?

Now we are ready to work with some of our first R commands. In R, commands are generally called functions.

A function in R (or any computing language) is a short program that takes some input and returns some output.

An R function has three key properties:

Functions have a name (e.g. dir, getwd); note that functions are case sensitive!

Following the name, functions have a pair of ()

Inside the parentheses, a function may take 0 or more arguments --- datacarpentry.org.

There are thousands of available functions to use in R, and if there isn't a function available for a specific task, you can write your own. We will be using many more functions, so there will be many more opportunities to learn the syntax.

We are going to run commands directly from our R script rather than typing into the R console.

Our first function will be getwd(). This simply prints your working directory and is the R equivalent of pwd (if you know Unix coding).

#print our working directory
getwd()

To run this function, we have a number of options. First, you can use the Run button above. This will run highlighted or selected code. You may also use the source button to run your entire script. My preferred method is to use keyboard shortcuts. Move your cursor to the code of interest and use command + return for macs or control + enter for PCs. If a command is taking a long time to run and you need to cancel it, use control + c from the command line or escape in RStudio. Once you run the command, you will see the command print to the console in blue followed by the output.

[1] "/vf/users/emmonsal/Getting_Started_with_R"

It is good practice to annotate your code using a comment. We can denote comments with #.

We designated or set our working directory when we created our R project, but if for some reason we needed to set our working directory, we can do this with setwd(). There is no need to run currently. However, if you were to run it, you would use the following notation:

setwd("path_to_your_directory")

The path should be in quotes. You can use tab completion to fill in the path.

What is a path?

According to Wikipedia, a path is "a string of characters used to uniquely identify a location in a directory structure."

Therefore, a file path simply tells us where a file or files are located. You will need to direct R to the location of files that you want to work with or output that you create.

The working directory is the location in your file system that you are currently working in. In other words, it is the default location that R will look for input files and write output files.

Note

R uses Unix formatting for directories, so regardless of whether you have a Windows computer or a mac, the way you enter the directory information will be the same. You can use tab completion to help you fill in directory information.

Getting help

Now we know a bit about using functions, but what if I had no idea what the function setwd() was used for or what arguments to provide?

Getting help in R is fairly easy. In the pane to the bottom right, you should see a Help tab. You can search for help regarding a specific topic using the search field (look for the magnifying glass).

Help

Alternatively, you can search directly for help in the console using ?setwd() or ??setwd(). help.search() or ?? can be used to search for a function using a keyword and will also work for unloaded packages; for example, you may try help.search("anova").

R help pages provide a lot of information. The description and argument sections are likely where you will want to start. If you are still unsure how to use the function, scroll down and check out the examples section of the documentation. Consider testing some of the examples yourself and applying to your own data.

Many R packages also include more detailed help documentation known as a vignette. To see a package vignette, use browseVignettes() (e.g., browseVignettes(package="dplyr")).

To see a function's arguments, you can use args().

args(setwd)

function (dir) 
NULL

Because setwd(dir) is used to set the working directory to dir, it requires only a single argument (dir).

Note

R arguments can be specified by name with `argument_name= ____", by position, or by partial name. More on this later.

Additional Sources for help

Try googling your problem or using some other search engine. rseek is an R specific search engine that searches several R related sites. If using Google or other major search engine directly, make sure you use R to tag your search.

Stack Overflow is a particularly great resource for finding help. If you post a question, you will need to make a reproducible example (reprex) and be as descriptive as possible regarding the problem. For this purpose, you may find the reprex package particularly useful.

To provide details about your R session, use

sessionInfo()

R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_4.5.0    fastmap_1.2.0     cli_3.6.4         tools_4.5.0      
 [5] htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10       rmarkdown_2.29   
 [9] knitr_1.50        jsonlite_2.0.0    xfun_0.52         digest_0.6.37    
[13] rlang_1.1.6       evaluate_1.0.3

Acknowledgments

Material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org. Material was also inspired by content from Introduction to data analysis with R and Bioconductor, which is part of the Carpentries Incubator.

Lesson 1: Introduction to R and RStudio IDE

Learning Objectives

What is R?

Why R?

Where do we get R packages?

Ways to run R

What is RStudio?

Getting Started with R and R Studio

Connect to RStudio on NIH HPC Open OnDemand

Creating an R project

Why renv?

Creating an R script

Introduction to the RStudio layout

When to use Source vs Console?

Uploading and exporting files from RStudio Server

Data Management

Saving your R environment (.Rdata)

What is a function?

What is a path?

Getting help

Additional Sources for help

Acknowledgments

When to use `Source` vs `Console`?