Lesson 1: Introduction to Biowulf, Unix, and R
Learning Objectives
- Learn about why you may want to use R on Biowulf.
- Refresh Unix and R skills.
This lesson will not be hands on.
Why use R for bioinformatics?
R is both a computational language and environment for statitical computing and graphics. It is open-source and widely used, not just by bioinformaticians. R is a particularly great resource for statistical analysis, plotting, and report generation, and it has become a powerhouse for biological assay data analysis (e.g., RNA-Seq, sc-RNAseq, ChIP-seq, population genomics). Package repositories like Bioconductor have influenced the rise of R programming in the -omics fields.
What is Bioconductor?
Bioconductor is an R package repository for free open-source software that "facilitates rigorous and reproducible analysis of data from current and emerging biological assays". Bioconductor is released semi-annually, with two working Bioconductor releases per every release of R. Packages in Bioconductor undergo rigorous testing to ensure the interoperability of included software.
Bioconductor not only provides methodologically based software packages, packages focused on offering new methods for the analysis of specific data types, but also software focused on core infrastructure. Package developers are encouraged to use existing Bioconductor infrastructure, for the storage and accession of data, to increase the usability of packages by minimizing the time spent learning new data structures for different workflows. This emphasis on common infrastructure classes makes the use of Bioconductor software scalable to emerging data types and methods. Developers can build off of existing infrastructure and methods to rapidly deploy new packages with technological advancements in the molecular sciences. Beyond software, Bioconductor offers other types of packages including those that focus on annotation, providing access to well known databases such as Entrez genes, Ensembl, UCSC, the Gene Ontology Consortium, KEGG, etc. In addtiion, there are experimental data packages that provide datasets for package validation or package tutorials, and workflow packages focused on combining aspects of multiple Bioconductor packages to complete a particular type of analysis.
The latest version of Bioconductor (v 3.17, compatible with R v.4.3) includes 2,230 software packages, 419 experiment data packages, 912 annotation packages, 27 workflows, and 3 books. The Bioconductor project strives to "further scientific understanding" through extensive documentation and training opportunities. Each package includes one or more quality vignettes outlining the use of included functions.
What is Biowulf, and why use R on Biowulf?
Biological datasets can be massive. Often our local computers (laptops, desktops) do not have the storage space or computational power to analyze these datasets. Biowulf is the NIH high performance compute cluster. It has greater than 90k processors, and can easily perform large numbers of simultaneous jobs. Biowulf also includes greater than 600 preinstalled scientific software and databases.
You should use Biowulf when: software is unavailable or difficult to install on your local computer and is available on Biowulf, you are working with large amounts of data that can be parallelized to shorten computational time, or you are performing computational tasks that are memory intensive.
Many of the initial data processing steps for most data types will be performed with unix-based bioinformatics software, often requiring one to use Biowulf, especially in the case of Window's users. Users may want to further analyze data output from these inital workflows, which can still include "large data", using Bioconductor or other R packages. Instead of transferring data from Biowulf to your local computer, it may be easier to use R directly on Biowulf compute nodes.
Warning
Never run computational tasks on the login node. Computational tasks on Biowulf should be submitted as a job (sbatch
, swarm
) or run through an interactive session (sinteractive
).
Danger
Do not put data with PII (personally identifiable information), patient data for example, on Biowulf.
Getting a Biowulf account
If you do not already have a Biowulf account, you can obtain one by following the instructions here. NIH HPC accounts are available to all NIH employees and contractors listed in the NIH Enterprise Directory. Obtaining an account requires PI approval and a nominal fee of $35 per month. Accounts are renewed annually contigent upon PI approval.
When you apply for a Biowulf account you will be issued two primary storage spaces:
/home/$USER
(16 GB)
/data/$USER
(100 GB)
You may request more space in /data/$USER
by filing an online storage request.
NIH HPC Documentation
The NIH HPC systems are well-documented at hpc.nih.gov. The User guides, Training documentation, and How To docs are fantastic resources for getting help with most HPC tasks.
Additional help
-
Contact staff@hpc.nih.gov
The HPC team welcomes questions and is happy to offer guidance to address your concerns. -
Monthly Zoom consult sessions
The HPC team offers montly zoom consult sessions. []"All problems and concerns are welcome, from scripting problems to node allocation, to strategies for a particular project, to anything that is affecting your use of the HPC systems. The Zoom details are emailed to all Biowulf users the week of the consult."](https://hpc.nih.gov/training/){target=_blank} -
Bioinformatics Training and Education Program
BTEP is here to help with all training needs. We are happy to help you get started with Biowulf and begin analyzing your data. If you experience any difficulties or challenges, especially with different bioinformatics applications, please do not hesitate to email us.
Unix Refresher
Biowulf computational nodes use a Unix-like (Linux) operating system (distributions RHEL8/Rocky8). Unix is a proprietary operating system like Windows or MacOS (Unix based). There are many Unix and Unix-like operating systems, including open source Linux and its multiple distributions. Biowulf requires knowledge and use of the command line interface (shell) to direct computational functionality. To work on the command line we need to be able to issue Unix commands to tell the computer what we want it to do.
A basic foundation of Unix is advantageous for most scientists, as many bioinformatics open-source tools are available or accessible by command line on Unix-like systems.
How much Unix do I need to know to work on Biowulf?
As with any language, the learning curve for Unix can be quite steep. However, to work on Biowulf you really need to understand the following:
- Directory navigation: what the directory tree is, how to navigate and move around with cd
- Absolute and relative paths: how to access files located in directories
- What simple Unix commands do: ls, mv, rm, mkdir, cat, man
- Getting help: how to find out more on what a unix command does
- What are “flags”: how to customize typical unix programs ls vs ls -l
- Shell redirection: what is the standard input and output, how to “pipe” or redirect the output of one program into the input of the other --- Biostar Handbook
Accessing your local terminal or command prompt
Mac OS
- Type
cmd + spacebar
and search for "terminal". Once open, right click on the app logo in the dock. SelectOptions
andKeep in Dock
.
Windows 10 or greater
You can start an SSH session in your command prompt by executing ssh user@machine and you will be prompted to enter your password. ---Windows documentation
To find the Command Prompt, type cmd
in the search box (lower left), then press Enter
to open the highlighted Command Prompt shortcut.
If this yields any major issues, try installing PuTTY, Solar-PuTTY, or MobaXterm.
Unix commands to know
The following list is not comprehensive. Only select commands are included.
Navigating the file system
pwd
(print working directory)ls
(list)cd
(change directory), by itself will take you home,cd ..
(will take you up one directory),cd /results_dir/exp1
(go directly to this directory)
File management
touch
creates an empty filenano
basic editor for creating small text filesrm
remove files or directories. Be careful!mkdir
make a directory andrmdir
(remove a directory with NO files)mv
rename or move files and directoriesless
andmore
view files;less
can also be used to view zipped files on Biowulf. Useq
to escape.cp
copy files or directoriescat
,head
, andtail
- print to screen, print first few lines to the screen, print last few lines to the screenzcat
viewing zipped fileschmod
,chown
modify file / directory permissionswc
number of lines (-l
), words (-w
), and bytes (-c
, usually one byte per character); for number of characters use-m
.grep
search files using regular expressionscut
cuts selected portions of a file (e.g., column selection)sed
andawk
- file editing (find and replace, column selection, filtering, etc.)
Obtaining help
help
display information about builtin commandsman
access online manual pages-h,--help
flags for obtaining help
Useful information
- Flags and command options (
-
) are used to alter program functions - Wildcards (e.g.,
*
) - Tab complete for less typing
- Accessing user history with the "up" and "down" arrows on the keyboard
- Working with file content (
<
,>
,>>
) - Combining commands with pipe (
|
). Where the heck is pipe anyway? - Performing repetitive actions with Unix (
for loop
), GNUparallel
File download
wget
The non-interactive network downloadercurl
transfer a URL
Remote connection
ssh
secure shell protocol for remote login to Biowulf / Helix
Biowulf
batchlim
show cpu and job limits for batch jobsfreen
show free and total nodes and coresjobdata
show lots of info for a single jobidsacct
select slurm jobssbatch
submit slurm jobscancel
delete slurm jobssinfo
view information about Slurm nodes and partitionssinteractive
allocate an interactive sessionsjobs
show brief summary of queued and running jobssqueue
display status of slurm batch jobssstat
display various status information of a running job/stepswarm
submit a swarm of commands to cluster
Modules on Biowulf
module avail
list available applications on Biowulfmodule load
load an applicationmodule purge
purge applications
Resources for learning Unix
Learning Unix: Classes / Courses
- Introduction to Biowulf (May – Jun, 2023)
- Introduction to Unix on Biowulf (Jan – Feb, 2023)
- Bioinformatics for Beginners: Module 1 Unix/Biowulf
Additional useful Unix resources
R Refresher
R can be accessed from the command line using R
, which opens the R console, or it can be accessed via and Integrated development environment (IDE) (e.g., RStudio, VSCode, etc.). R commands can be submitted together in a script or interactively in a console.
Navigating directories
setwd()
Set working directory (equivalent to cd
)
getwd()
Get working directory (equivalent to pwd
)
Getting help
help()
and ?
"provide access to the documentation pages for R functions, data sets, and other objects".
help.search()
"allows for searching the help system for documentation matching a given character string in the (file) name, alias, title, concept or keyword entries (or any combination thereof)"; equivalent to ??pattern
args()
returns information on function arguments including names and defaults
See more on getting help here.
Installing and loading packages
To take full advantage of R, you need to install R packages. R packages are loadable extensions that contain code, data, documentation, and tests in a standardized shareable format that can easily be installed by R users. The primary repository for R packages is the Comprehensive R Archive Network (CRAN). CRAN is a global network of servers that store identical versions of R code, packages, documentation, etc (cran.r-project.org).
An R library is, effectively, a directory of installed R packages which can be loaded and used within an R session. ---renv
install.packages()
install packages from CRAN
library()
load packages in R session
You will need to install and use the BiocManager
to install and use Bioconductor packages:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.17")
.libPaths()
reports the directory where your installed R packages are located.
devtools::install_github()
to install an R package from Github
Commenting
You can annotate your code by starting annotations with #
. Comments to the right of #
will be ignored by R.
Use # ----
to create navigable code sections.
Assignment operators
Anything that you want assigned to memory must be assigned to an R object.
<-
the primary assignment operator, assigning values on the right to objects on the left.
=
can also be used to assign values to objects, but is usually reserved for other purposes (e.g., function arguments)
Use ls()
to list objects created in R. rm()
can be used to remove an object from memory.
Object naming conventions
There are rules regarding the naming of objects:
- Avoid spaces or special characters EXCEPT '_' and '.'
- No numbers or underscores at the beginning of an object name.
- Avoid common names with special meanings (See ?Reserved) or assigned to existing functions (These will auto complete).
Note
R is case sensitive, so an object with the name "FOO" is not the same as "foo".
Object data types
There are many functions in R to understand the types of objects you are working with. For example:
class()
returns the class of an object
typeof()
returns type or storage mode of object
mode()
returns object storage mode
Importing and exporting data
Use the read
functions to import data (e.g., read.csv
, read.delim
, etc.). Use write
functions to export data (e.g., write.table
).
Using functions
An R function is like a unix command. Functions perform specific tasks. R has a ton of built-in functions and functions available through additional packages. You can also create your own functions.
The general syntax for a function is the name followed by parantheses, function_name()
(e.g., round()
).
To create a function:
function_name <- function(arg_1, arg_2, ...) {
Function body
}
Vectors
A vector is a collection of values that are all of the same type (numbers, characters, etc.) --- datacarpentry.org
c()
- used to combine elements of a vector
When you combine elements of different types in the same vector, they are forced into the same type via "coercion" (logical < numeric < character).
length()
- returns the number of elements in a vector
Use brackets to extract elements of a vector:
a <- 1:10
a[2]
Lists
Unlike vectors, lists can hold values of different types.
list(1, "apple", 3)
Data frames
Data frames hold tabular data comprised of rows and columns; they can be created using data.frame()
.
To understand more about the structure of an object and data frame, consider the following functions:
str()
displays the structure of an object, not just data frames
dplyr::glimpse()
similar to str
but applies to data frames and produces cleaner output
summary()
produces result summaries of the results of various model fitting functions
ncol()
returns number of columns in data frame
nrow()
returns number of rows of data frame
dim()
returns row and column numbers
unique()
returns a vector of with duplicates removed; also see dplyr::distinct()
We can subset data frames using bracket notation:
df<- data.frame(Counts=seq(1,5), animals=c("racoon","squirrel","bird","dog","cat"))
#to return just the animals column
df[,"animals"]
dplyr
such as filter()
for subsetting by row and select()
for subsetting by column.
Plotting
There are 3 primary plotting systems with R: base R, ggplot2
, and lattice
.
Check out the R Graph Gallery for data visualization examples and code.
Getting info on R Session
sessionInfo()
Print version information about R, the OS and attached or loaded packages.
Resources for learning R
Other cheat sheets can be here.
There is also a nice review here.
BTEP courses
Test your Knowledge
Are your Unix skills satisfactory?
Complete the scavenger hunt from https://sanderslab.github.io/code/.
Are your R skills ready?
Use this assessment to determine whether you need to further brush up on your R skills.
Do you need a Biowulf refresher?
So you think you know Biowulf? Quiz yourself using the hpc.nih.gov biowulf-quiz.