R Crash Course: A few things to know before diving into wrangling

Learning the Basics

Objectives
1. Learn about R objects
3. Learn how to recognize and use R functions
4. Learn about data types and accessors

Console vs. Script

We are going to begin by working in our console. In general, the console is used to run R code. If we want to run code quickly or test code, the console is the place to do this. If we want to keep our code and have a record of what we have been running to rerun or reference in the future, we should use the code editor to build an R script.

R can be used like a calculator

Let's use the console to run some basic mathematical operations.

398 + 783

## [1] 1181

475 / 5

## [1] 95

2 * 8906

## [1] 17812

(1 + (5 ** 0.5))/2

## [1] 1.618034

As you can see, ** were used for exponentiation. There are a number of other special operators used to perform math in R. Refer to this chart from datacarpentry.org or an overview of mathematical operators used in R.

Creating an R Script

If all R could do was function as a calculator, it wouldn't be very useful. R can be used for powerful analyses and visualizations.

As we learn more about R and begin implementing our first commands, we will keep a record of our commands using an R script. Remember, good annotation is key to reproducible data analysis. An R script can also be generated to run on its own without user interaction.

To create an R script, click File > New File > R Script or click on the new document icon (paper with a +) and select R script. You can save your now Untitled script by selecting the floppy disk icon. Give your script a meaningful name so that you can identify what it contains when returning to it later. R scripts end in .R. Save your R script to your working directory, which will be the default location on RStudio Server.

Important

Scripts are ordered. Running commands out of order will cause confusion later when you try to reproduce a given analysis step.

R Objects

Now that we have an R script, let's begin to work with R objects. Everything assigned a value in R is technically an object. Mostly we think of R objects as something in which a method (or function) can act on; however, R functions, too, are R object. R objects are what gets assigned to memory in R and are of a specific type or class. Objects include things like vectors, lists, matrices, arrays, factors, and data frames. Don't get too bogged down by terminology. Many of these terms will become clear as we begin to use them in our code. In order to be assigned to memory, an r object must be created.

Therefore, objects are data structures with specific attributes and methods that can be applied to them.

Creating and deleting objects

To create an R object, you need a name, a value, and an assignment operator (e.g., <- or =). R is case sensitive, so an object with the name "FOO" is not the same as "foo".

Let's create a simple object and run our code.

To run our code, we have a number of options. First, you can use the Run button above. This will run highlighted or selected code. You may also use the source button to run your entire script. My preferred method is to use keyboard shortcuts. Move your cursor to the code of interest and use command + enter for macs or control + enter for PCs. If a command is taking a long time to run and you need to cancel it, use control + c from the command line or escape in RStudio. Once you run the command, you will see the command print to the console in blue followed by the output. You do not need to highlight code to run it. If you do highlight code, make sure you are highlighting everything you plan to run. Highlighting can be a great way to only test small sections of nested code.

a<-1 #You can and should annotate your code with comments for better reproducibility. 
a #Simply call the name of the object to print the value to the screen

## [1] 1

In this example, "a" is the name of the object, 1 is the value, and <- is the assignment operator. We inspect objects simply by typing or running their name.

Naming conventions and reproducibility

There are rules regarding the naming of objects.

Avoid spaces or special characters EXCEPT '_' and '.'

No numbers or symbols at the beginning of an object name.

For example:

1a<-"apples" # this will throw and error
1a

## Error: <text>:1:2: unexpected symbol
## 1: 1a
##      ^

In contrast:

a<-"apples" #this works fine
a

## [1] "apples"

What do you think would have happened if we didn't put 'apples' in quotes? Try it.

a<-apples

## Error in eval(expr, envir, enclos): object 'apples' not found

Avoid common names with special meanings or assigned to existing functions (These will auto complete).

See the tidyverse style guide for more information on naming conventions.

How do I know what objects have been created?

To view a list of the objects you have created, use ls() or look at your global environment pane.

Reassigning and deleting objects

To reassign an object, simply overwrite the object.

#object with gene named 'tp53'
gene_name<-"tp53"
gene_name

## [1] "tp53"

#if instead we want to reassign gene_name to a different gene, we would use:
gene_name<-"GH1"
gene_name

## [1] "GH1"

Warning

R will not warn you when objects are being overwritten, so use caution.

To delete an object from memory:

# delete the object 'gene_name'
rm(gene_name)

#the object no longer exists, so calling it will result in an error
gene_name

## Error in eval(expr, envir, enclos): object 'gene_name' not found

Things to note

R doesn't care about spaces in your code. However, it can vastly improve readability if you include them. For example, "thisissohardtoread" but "this is fine".

You can use tab completion to quickly type object or function names.

For example:

```{py3 hl_lines="1-100"}
clifford<-"a big red dog"
```
Now, type "clif" into the console and hit tab.

Quotes are used anytime you are entering character string values. Either single or double quotes can be used. Otherwise, R will think you are calling an object.

Using functions

A function in R (or any computing language) is a short program that takes some input and returns some output.

An R function has three key properties:

Functions have a name (e.g. dir, getwd); note that functions are case sensitive!

Following the name, functions have a pair of ()

Inside the parentheses, a function may take 0 or more arguments --- datacarpentry.org

To create a function, you can use the following syntax:

function_name <- function(arg_1, arg_2, ...) {
   Function body 
}

Navigating directories

Now that we know what a function is, let's use them to navigate our directories.

Our first function will be getwd(). This simply prints your working directory (our default directory for saving files) and is the R equivalent of pwd (if you know unix coding).

#print our working directory
getwd()

[1] "/home/rstudio/"

How can we find out what arguments a function takes?

For details on function arguments and examples of how to use the function, we should check the package / function documentation. We can get help by preceding a function with ? or ?? if the package library has not been loaded. We can also use the function args().

Let's see this in action with setwd(). setwd() is used to change our working directory. If we want to know what argument it takes, we can try the help documentation.

?setwd()

As we can see from the help documentation, setwd() requires the argument dir, which requires a character string pointing to the correct directory. The path should be in quotes, and you can use tab completion to fill in the path as needed.

Note

R uses unix formatting for directories, so regardless of whether you have a Windows computer or a mac, the way you enter the directory information will be the same.

Function arguments are positional

Function arguments are positional, meaning the order matters unless the argument is explicitly stated. Let's see this in practice with the function round().

round()

rounds the values in its first argument to the specified number of decimal places (default 0) --- R help.

This implies that the first argument should be the number you want to round.

Let's see an example:

round(17.664, 2) #round 17.664 to 17.66

## [1] 17.66

round(2, 17.664) #round 2 (second argument ignored)

## [1] 2

round(digits=2, 17.664) #explicitly state one of the arguments

## [1] 17.66

Some common functions

There are several functions that you will see repeatedly as you use R more and more.

One of those is c(), which is used to combine its arguments to form a vector.

Vectors are probably the most used commonly used object type in R. A vector is a collection of values that are all of the same type (numbers, characters, etc.). The columns that make up a data frame, for example, are vectors of the same length. --- datacarpentry.org.

Let's create some vectors.

transcript_names<-c("TSPAN6","TNMD","SCYL3","GCLC") #We can create a character vector  

transcript_names<-c(TSPAN6,"TNMD",SCYL3,"GCLC") #Why doesn't this work?

## Error in eval(expr, envir, enclos): object 'TSPAN6' not found

transcript_counts <- c(679, 0, 467, 260,  60,   0) #combine numbers

sample_names<-c("1","B","3","D") #This is poor practice; stay consistent

sample_names<-c("Sample1","Sample2","Sample3")

more_samps<-c("Sample4","Sample5")

sample_names<-c(sample_names,more_samps) #combine two vectors

Here is a short list of functions that are commonly used and good to keep in mind:

rbind(), cbind() - Combine vectors by row/column

grep() - regular expressions1

identical() - test if 2 objects are exactly equal

length() - no. of elements in vector

ls() - list objects in current environment

rep(x,n) - repeat the number x, n times

rev(x) - elements of x in reverse order

seq(x,y,n) - sequence (x to y, spaced by n)

sort(x) - sort the vector x

order(x) - list the sorted element numbers of x

tolower(),toupper() - Convert string to lower/upper case letters

unique(x) - remove duplicate entries from vector

round(x), signif(x), trunc(x) - rounding functions

month.abb/month.name - abbreviated and full names for months

pi, letters, (e.g. letters[7] = ”g”) LETTERS

lm - fit linear model

mean(x), weighted.mean(x), median(x), min(x), max(x), quantile(x)

sd() - standard deviation

summary(x) - a summary of x (mean, min, max)
--- Charles Dimaggio, columbia.edu

You may also find this function reference card valuable: reference card

Explicitly calling a function

At times a function may be masked by another function. This can happen if two functions are named the same (e.g., dplyr::filter() vs plyr::filter()). We can get around this by explicitly calling a function from the correct package using the following syntax: package::function().

A quick look at data frames

We will mostly be working with data frames throughout this course. Data frames hold tabular data, and as such are collections of vectors of the same length, but can be of different types.

Example data frame:

#create a data frame using data.frame()
df<-data.frame(id=paste("Sample",1:10,sep="_"), cell=rep(factor(c("cell_line_A","cell_line_B")),each=10),counts=sample(1:1000,20,replace=TRUE))

What do we mean by data types?

The data type of an R object affects how that object can be used or will behave. Examples of base R data types include numeric, integer, complex, character, and logical. R objects can also have certain assigned attributes (related to class), and these attributes will be important for how they interact with certain methods / functions. Ultimately, understanding the mode / type and class of an object will be important for how an object can be used in R. When the mode of an object is changed, we call this "coercion". You may see a coercion warning pop up when working with objects in the future.

Here are some of the most notable modes: Modes (from datacarpentry.org)

The mode or type of an object can be examined using mode() or typeof(). Its class can be examined using class(). Unlike modes and types, classes are unlimited and can be user defined. Classes can often be more informative. For example, class() may return data.frame, array, matrix, or factor.

od_600_value <- 0.47
typeof(od_600_value)
## [1] "double"

chr_position <- '1001701bp'
typeof(chr_position)
## [1] "character"

spock <- TRUE
typeof(spock)
## [1] "logical"

typeof(df)
## [1] "list"
class(df)
## [1] "data.frame"

Because data frames are made up of columns that can store different types of data, we can examine the overall structure of a data frame using str().

str(df)

## 'data.frame':    20 obs. of  3 variables:
##  $ id    : chr  "Sample_1" "Sample_2" "Sample_3" "Sample_4" ...
##  $ cell  : Factor w/ 2 levels "cell_line_A",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ counts: int  502 514 30 648 618 325 417 204 753 600 ...

There are functions that can gage types and classes directly, for example, is.numeric(), is.character(), is.logical(), is.data.frame(), is.matrix(), is.factor().

What are factors?

Factors can be thought of as vectors which are specialized for categorical data. Given R’s specialization for statistics, this make sense since categorial and continuous variables are usually treated differently. Sometimes you may want to have data treated as a factor, but in other cases, this may be undesirable.

We will discuss factors more later when plotting. Functions most relevant to factors include factor() and levels().

Data frame accessors

We can access a column of our data frame using [], [[]], or using the $. We can use colnames() and rownames() to access the column names and row names of a data frame.

For example:

df[["cell"]]

##  [1] cell_line_A cell_line_A cell_line_A cell_line_A cell_line_A cell_line_A
##  [7] cell_line_A cell_line_A cell_line_A cell_line_A cell_line_B cell_line_B
## [13] cell_line_B cell_line_B cell_line_B cell_line_B cell_line_B cell_line_B
## [19] cell_line_B cell_line_B
## Levels: cell_line_A cell_line_B

df["cell"]

##           cell
## 1  cell_line_A
## 2  cell_line_A
## 3  cell_line_A
## 4  cell_line_A
## 5  cell_line_A
## 6  cell_line_A
## 7  cell_line_A
## 8  cell_line_A
## 9  cell_line_A
## 10 cell_line_A
## 11 cell_line_B
## 12 cell_line_B
## 13 cell_line_B
## 14 cell_line_B
## 15 cell_line_B
## 16 cell_line_B
## 17 cell_line_B
## 18 cell_line_B
## 19 cell_line_B
## 20 cell_line_B

df$cell

##  [1] cell_line_A cell_line_A cell_line_A cell_line_A cell_line_A cell_line_A
##  [7] cell_line_A cell_line_A cell_line_A cell_line_A cell_line_B cell_line_B
## [13] cell_line_B cell_line_B cell_line_B cell_line_B cell_line_B cell_line_B
## [19] cell_line_B cell_line_B
## Levels: cell_line_A cell_line_B

colnames(df)

## [1] "id"     "cell"   "counts"

rownames(df)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20"

Subsetting

Check out this chapter of Advanced R for more on subsetting operators.

Uploading and exporting files from RStudio Server

RStudio Server works via a web browser, and so you see this additional Upload option in the Files pane. If you select this option, you can upload files from your local computer into the server environment. If you select More, you will also see an Export option. You can use this to export the files created in the RStudio environment.

Saving your R environment (.Rdata)

When exiting RStudio, you will be prompted to save your R workspace or .RData. The .RData file saves the objects generated in your R environment. You can also save the .RData at any time using the floppy disk icon just below the Environment tab. You may also save your R workspace from the console using save.image(). You may load .Rdata by using load().

Additional tips, tricks, and things to know

You can use the up arrow on your keyboard when using the console to pull up previously used commands.
Certain symbols in R always come in pairs, for example, parentheses and quotation marks. If you do not provide a pair, R will return a continuation character +, meaning it is waiting for more input to complete the command. You provide the missing information or press ESCAPE.

You can print an object after creating using parantheses shortcuts.

For example,

(coolfeature<-"printing an object automatically")

## [1] "printing an object automatically"

Acknowledgments

Material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org.

Additional Resources

Hands-on Programming with R

R reference card

R specific search engine, rseek