Lesson 3: Basics of R Programming: Vectors

Objectives

To understand some of the most basic features of the R language including creating, modifying, sub-setting, and exporting vectors.

As with previous lessons, to get started with this lesson, you will first need to connect to RStudio on Biowulf. To connect to NIH HPC Open OnDemand, you must be on the NIH network. Use the following website to connect: https://hpcondemand.nih.gov/. Then follow the instructions outlined here.

Vectors

Vectors are probably the most commonly used object type in R. A vector is a collection of values that are all of the same type (numbers, characters, etc.). The columns that make up a data frame are vectors. One of the most common ways to create a vector is to use the c() function - the “concatenate” or “combine” function. Inside the function you may enter one or more values; for multiple values, separate each value with a comma. --- datacarpentry.org.

Creating vectors

#create a vector of gene names
transcript_names <- c("TSPAN6", "TNMD", "SCYL3", "GCLC")
transcript_names

[1] "TSPAN6" "TNMD"   "SCYL3"  "GCLC"

Let's check out the type of data within the vector. What do you think?

typeof(transcript_names)

[1] "character"

Another property of vectors worth exploring is their length. Try length()

length(transcript_names)

[1] 4

In addition, you can assess the underlying structure of the object (vector in this case) by using str(). str() will be invaluable for understanding more complicated data structures such as matrices and data frames, which will be discussed later.

# this will return properties of the object's underlying structure
# in this case, the length and type
str(transcript_names)

 chr [1:4] "TSPAN6" "TNMD" "SCYL3" "GCLC"

Here, the length and type of data in the vector are returned, as well as a summary of the data.

#We know this is a vector from the length but you could always check  
is.vector(transcript_names)

[1] TRUE

Vectors can also have a names attribute.

counts<-c("TSPAN6"= 679, "TNMD" = 0, "SCYL3" = 467)
counts

TSPAN6   TNMD  SCYL3 
   679      0    467

names(counts)

[1] "TSPAN6" "TNMD"   "SCYL3"

Creating, modifying, sub-setting exporting

Let's learn how to further work with vectors, including creating, sub-setting, modifying, and saving. First, we will create a few vectors. Again, the c() vector is necessary for this task.

#Some possible RNASeq data
cell_line<- c("N052611", "N061011", "N080611", "N61311" )
sample_id <- c("SRR1039508", "SRR1039509", "SRR1039512", 
               "SRR1039513", "SRR1039516", "SRR1039517", 
               "SRR1039520", "SRR1039521")
transcript_counts <- c(679, 0, 467, 260,  60,   0)

Creating vectors with functions

Vectors can also be created with different functions. Some common functions used to create vectors include seq() and rep().

Vector operations

If our vectors are numeric, we can apply mathematic operations and arithmetic expressions.

# Apply some basic math
transcript_counts + 10

[1] 689  10 477 270  70  10

transcript_counts^2 +100

[1] 461141    100 218189  67700   3700    100

# Transform the data using a log 10 transformation
log10(transcript_counts + 1)

[1] 2.832509 0.000000 2.670246 2.416641 1.785330 0.000000

# Add two vectors together
transcript_counts + rep(2,times=6)
## [1] 681   2 469 262  62   2

#Add different sized vectors
transcript_counts + c(0,1)
## [1] 679   1 467 261  60   1

transcript_counts + c(0,1,0,1)
## Warning in transcript_counts + c(0, 1, 0, 1): longer object length is not a
## multiple of shorter object length
## [1] 679   1 467 261  60   1

Some things to note here:

With vectors of the same length, we can add, subtract, multiply, etc., but operations are performed on elements in the same position of each vector.
With vectors of different lengths, the shorter vector will be recyled until the operation is complete. If the larger vector is not a multiple of the shorter vector, a warning will be thrown.

Vector sub-setting

There may be moments where you want to retrieve a specific value or values from a vector. To do this, we use bracket notation sub-setting ([]).In bracket notation, you call the name of the vector followed by brackets. The brackets contain an index for the value that we want. The index is the numerical position of the value in the vector. For example, take a look at cell_line.

cell_line

[1] "N052611" "N061011" "N080611" "N61311"

The first position [1] is held by "N052611". The next position is 2 followed by 3, etc.

With numerical indexing, we can access a given value from the vector using name[index], where name is the name of the vector, and index is the numerical position within the vector.

Let's get the second value from cell_types.

cell_line[2]

[1] "N061011"

In R vector indices start with 1 and end with length(vector). This is important and can differ based on programming language.

For example:

Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.---bioc-intro.

So to extract the last element in a vector, you could use the following annotation:

#retrieve the last element in the sample_id vector
sample_id[length(sample_id)]

[1] "SRR1039521"

This is the same as:

#retrieve the last element in the sample_id vector
sample_id[8]

[1] "SRR1039521"

You may also want to subset a range of values. In R, use a colon (:) to represent a range.

#Retrieve the 2nd and 3rd value from cell_line
cell_line[2:3]

[1] "N061011" "N080611"

#Retrieve the 1st, 4th, 5th, and 6th values from transcript_counts
transcript_counts[c(1,4:6)]

[1] 679 260  60   0

The combine function c() can also be used to add 1 or more elements to a vector. To be overwritten the object has to be reassigned.

#Lets add two genes to transcript_names
transcript_names <- c(transcript_names, "ANAPC10P1", "ABCD1") 
transcript_names
## [1] "TSPAN6"    "TNMD"      "SCYL3"     "GCLC"      "ANAPC10P1" "ABCD1"

Subtraction can be used to remove a value.

#Let's remove "SCYL3"
transcript_names <- transcript_names[-3]
transcript_names

[1] "TSPAN6"    "TNMD"      "GCLC"      "ANAPC10P1" "ABCD1"

We can rename a value by

#Let's rename "GCLC"
transcript_names[3] <- "NNAME"
transcript_names

[1] "TSPAN6"    "TNMD"      "NNAME"     "ANAPC10P1" "ABCD1"

We can use the names attribute to query or subset a vector.

counts["SCYL3"]

SCYL3 
  467

We can also call a value directly; More on this below.

#Rename "ABCD1" to "NEW"
transcript_names[transcript_names == "ABCD1"]  <- "NEW" 
transcript_names

[1] "TSPAN6"    "TNMD"      "NNAME"     "ANAPC10P1" "NEW"

Logical subsetting

It is also possible to subset in R using logical evaluation or numerical comparison. To do this, we use comparison operators, as we did in the last example. See the table below for a list of operators.

Comparison Operator	Description
>	greater than
>=	greater than or equal to
<	less than
<=	less than or equal to
!=	Not equal
==	equal
a \| b	a or b
a & b	a and b

So if, for example, we wanted a subset of all transcript counts greater than 260, we could use indexing combined with a comparison operator:

transcript_counts[transcript_counts > 260]

[1] 679 467

Why does this work? Let's break down the code.

transcript_counts > 260

[1]  TRUE FALSE  TRUE FALSE FALSE FALSE

This returns a logical vector. We can see that positions 1 and 3 are TRUE, meaning they are greater than 260. Therefore, the initial sub-setting above is asking for a subset based on TRUE values. Here is the equivalent:

transcript_counts[c( TRUE, FALSE, TRUE, FALSE, FALSE, FALSE)]

[1] 679 467

You can also use this functionality to do a kind of find and replace. Perhaps we want to find zero values and replace them with NAs. We could use:

transcript_counts[transcript_counts==0]<-NA

Note

if you instead ran transcript_counts[transcript_counts==0]<-"NA", you would coerce this vector to a character vector.

Now, if we want to return only values that aren't NAs, we can use

transcript_counts[!is.na(transcript_counts)] #values that aren't NAs

[1] 679 467 260  60

is.na(transcript_counts) #if you simply want to know if there are NAs

[1] FALSE  TRUE FALSE FALSE FALSE  TRUE

which(is.na(transcript_counts)) #if you want the indices of those NAs

[1] 2 6

Other ways to handle missing data

Other functions you may find useful when working with NAs inclue na.omit() and complete.cases().

na.omit() removes the NAs from a vector.

na.omit(transcript_counts)

[1] 679 467 260  60
attr(,"na.action")
[1] 2 6
attr(,"class")
[1] "omit"

complete.cases() creates a logical vector that you can use for subs-etting based on the absence of NAs.

transcript_counts[complete.cases(transcript_counts)]

[1] 679 467 260  60

Many functions will also have an na.rm argument. For example, see ?mean.

Using objects to store thresholds

To make scripting reproducible, you could avoid calling a specific number directly and use objects in logical evaluations like those above. If we use an object, the value itself could easily be replaced with whatever value is needed. For example:

trnsc_cutoff <- 260
#note: this will also include NAs in the output
transcript_counts[transcript_counts>trnsc_cutoff]

[1] 679  NA 467  NA

#if we want to exclude possible NAs, something like this will work
transcript_counts[!is.na(transcript_counts) & transcript_counts>trnsc_cutoff]

[1] 679 467

Using the `%in%` operator.

There may be a time you want to know whether there are specific values in your vector. To do this, we can use the %in% operator (?match()). This operator returns TRUE for any value that is in your vector and can be used for sub-setting. It makes more sense to use this with data frames but we can see how this works here.

For example:

# have a look at transcript_names
transcript_names

[1] "TSPAN6"    "TNMD"      "NNAME"     "ANAPC10P1" "NEW"

# test to see if "NNAME" and "ANAPC10P1" are in this vector
# if you are looking for more than one value, you must pass this as a vector

c("NNAME","ANAPC10P1") %in% transcript_names

[1] TRUE TRUE

#We could also save the search vector to an object and search that way.
find_transcripts<-c("NNAME","ANAPC10P1")
find_transcripts %in% transcript_names

[1] TRUE TRUE

#to use this for subetting the vector lengths should match
transcript_names[transcript_names %in% find_transcripts]

[1] "NNAME"     "ANAPC10P1"

Saving and loading objects

We discussed saving the R workspace (.RData), but what if we simply want to save a single object. In such a case, we can use saveRDS().

Let's save our transcript_counts vector to our working directory.

saveRDS(transcript_counts,"transcript_counts.rds")

Check the Files pane for your newly created file. Make sure you are viewing the contents of your working directory (getwd()).

To load the object back into your R workspace, use readRDS().

Acknowledgments

Material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org. Material was also inspired by content from Introduction to data analysis with R and Bioconductor, which is part of the Carpentries Incubator.