Lesson 1: Introduction to Biowulf, Unix, and R

Learning Objectives

Learn about why you may want to use R on Biowulf.
Refresh Unix and R skills.

This lesson will not be hands on.

Why use R for bioinformatics?

R is both a computational language and environment for statitical computing and graphics. It is open-source and widely used, not just by bioinformaticians. R is a particularly great resource for statistical analysis, plotting, and report generation, and it has become a powerhouse for biological assay data analysis (e.g., RNA-Seq, sc-RNAseq, ChIP-seq, population genomics). Package repositories like Bioconductor have influenced the rise of R programming in the -omics fields.

What is Bioconductor?

Bioconductor is an R package repository for free open-source software that "facilitates rigorous and reproducible analysis of data from current and emerging biological assays". Bioconductor is released semi-annually, with two working Bioconductor releases per every release of R. Packages in Bioconductor undergo rigorous testing to ensure the interoperability of included software.

Bioconductor not only provides methodologically based software packages, packages focused on offering new methods for the analysis of specific data types, but also software focused on core infrastructure. Package developers are encouraged to use existing Bioconductor infrastructure, for the storage and accession of data, to increase the usability of packages by minimizing the time spent learning new data structures for different workflows. This emphasis on common infrastructure classes makes the use of Bioconductor software scalable to emerging data types and methods. Developers can build off of existing infrastructure and methods to rapidly deploy new packages with technological advancements in the molecular sciences. Beyond software, Bioconductor offers other types of packages including those that focus on annotation, providing access to well known databases such as Entrez genes, Ensembl, UCSC, the Gene Ontology Consortium, KEGG, etc. In addtiion, there are experimental data packages that provide datasets for package validation or package tutorials, and workflow packages focused on combining aspects of multiple Bioconductor packages to complete a particular type of analysis.

The latest version of Bioconductor (v 3.17, compatible with R v.4.3) includes 2,230 software packages, 419 experiment data packages, 912 annotation packages, 27 workflows, and 3 books. The Bioconductor project strives to "further scientific understanding" through extensive documentation and training opportunities. Each package includes one or more quality vignettes outlining the use of included functions.

What is Biowulf, and why use R on Biowulf?

Biological datasets can be massive. Often our local computers (laptops, desktops) do not have the storage space or computational power to analyze these datasets. Biowulf is the NIH high performance compute cluster. It has greater than 90k processors, and can easily perform large numbers of simultaneous jobs. Biowulf also includes greater than 600 preinstalled scientific software and databases.

You should use Biowulf when: software is unavailable or difficult to install on your local computer and is available on Biowulf, you are working with large amounts of data that can be parallelized to shorten computational time, or you are performing computational tasks that are memory intensive.

Many of the initial data processing steps for most data types will be performed with unix-based bioinformatics software, often requiring one to use Biowulf, especially in the case of Window's users. Users may want to further analyze data output from these inital workflows, which can still include "large data", using Bioconductor or other R packages. Instead of transferring data from Biowulf to your local computer, it may be easier to use R directly on Biowulf compute nodes.

Warning

Never run computational tasks on the login node. Computational tasks on Biowulf should be submitted as a job (sbatch, swarm) or run through an interactive session (sinteractive).

Danger

Do not put data with PII (personally identifiable information), patient data for example, on Biowulf.

Getting a Biowulf account

If you do not already have a Biowulf account, you can obtain one by following the instructions here. NIH HPC accounts are available to all NIH employees and contractors listed in the NIH Enterprise Directory. Obtaining an account requires PI approval and a nominal fee of $35 per month. Accounts are renewed annually contigent upon PI approval.

When you apply for a Biowulf account you will be issued two primary storage spaces:
/home/$USER (16 GB)
/data/$USER (100 GB)

You may request more space in /data/$USER by filing an online storage request.

NIH HPC Documentation

The NIH HPC systems are well-documented at hpc.nih.gov. The User guides, Training documentation, and How To docs are fantastic resources for getting help with most HPC tasks.

Additional help

Contact staff@hpc.nih.gov
The HPC team welcomes questions and is happy to offer guidance to address your concerns.
Monthly Zoom consult sessions
The HPC team offers montly zoom consult sessions. []"All problems and concerns are welcome, from scripting problems to node allocation, to strategies for a particular project, to anything that is affecting your use of the HPC systems. The Zoom details are emailed to all Biowulf users the week of the consult."](https://hpc.nih.gov/training/){target=_blank}
Bioinformatics Training and Education Program
BTEP is here to help with all training needs. We are happy to help you get started with Biowulf and begin analyzing your data. If you experience any difficulties or challenges, especially with different bioinformatics applications, please do not hesitate to email us.

Unix Refresher

Biowulf computational nodes use a Unix-like (Linux) operating system (distributions RHEL8/Rocky8). Unix is a proprietary operating system like Windows or MacOS (Unix based). There are many Unix and Unix-like operating systems, including open source Linux and its multiple distributions. Biowulf requires knowledge and use of the command line interface (shell) to direct computational functionality. To work on the command line we need to be able to issue Unix commands to tell the computer what we want it to do.

A basic foundation of Unix is advantageous for most scientists, as many bioinformatics open-source tools are available or accessible by command line on Unix-like systems.

How much Unix do I need to know to work on Biowulf?

As with any language, the learning curve for Unix can be quite steep. However, to work on Biowulf you really need to understand the following:

Directory navigation: what the directory tree is, how to navigate and move around with cd

Absolute and relative paths: how to access files located in directories

What simple Unix commands do: ls, mv, rm, mkdir, cat, man

Getting help: how to find out more on what a unix command does

What are “flags”: how to customize typical unix programs ls vs ls -l

Shell redirection: what is the standard input and output, how to “pipe” or redirect the output of one program into the input of the other --- Biostar Handbook

Accessing your local terminal or command prompt

Mac OS

Type cmd + spacebar and search for "terminal". Once open, right click on the app logo in the dock. Select Options and Keep in Dock.

Windows 10 or greater

You can start an SSH session in your command prompt by executing ssh user@machine and you will be prompted to enter your password. ---Windows documentation

To find the Command Prompt, type cmd in the search box (lower left), then press Enter to open the highlighted Command Prompt shortcut.

If this yields any major issues, try installing PuTTY, Solar-PuTTY, or MobaXterm.

Unix commands to know

The following list is not comprehensive. Only select commands are included.

Navigating the file system

pwd (print working directory)
ls (list)
cd (change directory), by itself will take you home, cd .. (will take you up one directory), cd /results_dir/exp1 (go directly to this directory)

File management

touch creates an empty file
nano basic editor for creating small text files
rm remove files or directories. Be careful!
mkdir make a directory and rmdir (remove a directory with NO files)
mv rename or move files and directories
less and more view files; less can also be used to view zipped files on Biowulf. Use q to escape.
cp copy files or directories
cat, head, and tail - print to screen, print first few lines to the screen, print last few lines to the screen
zcat viewing zipped files
chmod,chown modify file / directory permissions
wc number of lines (-l), words (-w), and bytes (-c, usually one byte per character); for number of characters use -m.
grep search files using regular expressions
cut cuts selected portions of a file (e.g., column selection)
sed and awk - file editing (find and replace, column selection, filtering, etc.)

Obtaining help

help display information about builtin commands
man access online manual pages
-h,--help flags for obtaining help

Useful information

Flags and command options (-) are used to alter program functions
Wildcards (e.g., *)
Tab complete for less typing
Accessing user history with the "up" and "down" arrows on the keyboard
Working with file content (<, >, >>)
Combining commands with pipe (|). Where the heck is pipe anyway?
Performing repetitive actions with Unix (for loop), GNU parallel

File download

wget The non-interactive network downloader
curl transfer a URL

Remote connection

ssh secure shell protocol for remote login to Biowulf / Helix

Biowulf

batchlim show cpu and job limits for batch jobs
freen show free and total nodes and cores
jobdata show lots of info for a single jobid
sacct select slurm jobs
sbatch submit slurm job
scancel delete slurm jobs
sinfo view information about Slurm nodes and partitions
sinteractive allocate an interactive session
sjobs show brief summary of queued and running jobs
squeue display status of slurm batch jobs
sstat display various status information of a running job/step
swarm submit a swarm of commands to cluster

Modules on Biowulf

module avail list available applications on Biowulf
module load load an application
module purge purge applications

Resources for learning Unix

Learning Unix: Classes / Courses

Additional useful Unix resources

R Refresher

R can be accessed from the command line using R, which opens the R console, or it can be accessed via and Integrated development environment (IDE) (e.g., RStudio, VSCode, etc.). R commands can be submitted together in a script or interactively in a console.

Navigating directories

setwd() Set working directory (equivalent to cd)
getwd() Get working directory (equivalent to pwd)

Getting help

help() and ? "provide access to the documentation pages for R functions, data sets, and other objects".

help.search() "allows for searching the help system for documentation matching a given character string in the (file) name, alias, title, concept or keyword entries (or any combination thereof)"; equivalent to ??pattern

args() returns information on function arguments including names and defaults

See more on getting help here.

Installing and loading packages

To take full advantage of R, you need to install R packages. R packages are loadable extensions that contain code, data, documentation, and tests in a standardized shareable format that can easily be installed by R users. The primary repository for R packages is the Comprehensive R Archive Network (CRAN). CRAN is a global network of servers that store identical versions of R code, packages, documentation, etc (cran.r-project.org).

An R library is, effectively, a directory of installed R packages which can be loaded and used within an R session. ---renv

install.packages() install packages from CRAN
library() load packages in R session

You will need to install and use the BiocManager to install and use Bioconductor packages:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.17")

.libPaths() reports the directory where your installed R packages are located.

devtools::install_github() to install an R package from Github

Commenting

You can annotate your code by starting annotations with #. Comments to the right of # will be ignored by R.

Use # ---- to create navigable code sections.

Assignment operators

Anything that you want assigned to memory must be assigned to an R object.

<- the primary assignment operator, assigning values on the right to objects on the left.
= can also be used to assign values to objects, but is usually reserved for other purposes (e.g., function arguments)

Use ls() to list objects created in R. rm() can be used to remove an object from memory.

Object naming conventions

There are rules regarding the naming of objects:

Avoid spaces or special characters EXCEPT '_' and '.'
No numbers or underscores at the beginning of an object name.
Avoid common names with special meanings (See ?Reserved) or assigned to existing functions (These will auto complete).

Note

R is case sensitive, so an object with the name "FOO" is not the same as "foo".

Object data types

There are many functions in R to understand the types of objects you are working with. For example:

class() returns the class of an object
typeof() returns type or storage mode of object
mode() returns object storage mode

Importing and exporting data

Use the read functions to import data (e.g., read.csv, read.delim, etc.). Use write functions to export data (e.g., write.table).

Using functions

An R function is like a unix command. Functions perform specific tasks. R has a ton of built-in functions and functions available through additional packages. You can also create your own functions.

The general syntax for a function is the name followed by parantheses, function_name() (e.g., round()).

To create a function:

function_name <- function(arg_1, arg_2, ...) {
   Function body 
}

Vectors

A vector is a collection of values that are all of the same type (numbers, characters, etc.) --- datacarpentry.org

c() - used to combine elements of a vector

When you combine elements of different types in the same vector, they are forced into the same type via "coercion" (logical < numeric < character).

length() - returns the number of elements in a vector

Use brackets to extract elements of a vector:

a <- 1:10
a[2]

Lists

Unlike vectors, lists can hold values of different types.

list(1, "apple", 3)

Data frames

Data frames hold tabular data comprised of rows and columns; they can be created using data.frame().

To understand more about the structure of an object and data frame, consider the following functions:

str() displays the structure of an object, not just data frames
dplyr::glimpse()similar to str but applies to data frames and produces cleaner output
summary() produces result summaries of the results of various model fitting functions
ncol() returns number of columns in data frame
nrow() returns number of rows of data frame
dim() returns row and column numbers
unique() returns a vector of with duplicates removed; also see dplyr::distinct()

We can subset data frames using bracket notation:

df<- data.frame(Counts=seq(1,5), animals=c("racoon","squirrel","bird","dog","cat"))
#to return just the animals column  
df[,"animals"]

We can also use functions from dplyr such as filter() for subsetting by row and select() for subsetting by column.

Plotting

There are 3 primary plotting systems with R: base R, ggplot2, and lattice.

Check out the R Graph Gallery for data visualization examples and code.

Getting info on R Session

sessionInfo() Print version information about R, the OS and attached or loaded packages.

Resources for learning R

Base R cheat sheet

Other cheat sheets can be here.

There is also a nice review here.

BTEP courses

Test your Knowledge

Are your Unix skills satisfactory?

Complete the scavenger hunt from https://sanderslab.github.io/code/.

Are your R skills ready?

Use this assessment to determine whether you need to further brush up on your R skills.

Do you need a Biowulf refresher?

So you think you know Biowulf? Quiz yourself using the hpc.nih.gov biowulf-quiz.