ncibtep@nih.gov

Bioinformatics Training and Education Program

Why should Biologists learn Unix for Bioinformatics?

Unix, what is it, and why should biologists take the time to learn it?

The Unix operating system forms the basis of many bioinformatics analyses resources, such as the NIH High Performance Cluster (HPC) Biowulf/Helix. Here we are going to examine the reasons why it may be worth your time and effort to learn to interact with Unix-based operating systems using the command line.

As scientists, many of us are used to working with Windows or Mac operating systems. These point-and-click systems are intuitive and something we’ve experienced throughout our careers. There are molecular biology packages for working with sequence data that exist for both systems, allowing us to generate PCR primers, engineer plasmids, and align sequences. It follows that we also look to these systems for data management and find tools like Excel can help us organize and analyze data. At some point however, when your spreadsheets are hundreds of thousands, or millions of lines, and your sequence data files are too large to open, there’s got to be another way.

Here are some advantages of learning how to interact with Unix-based operating systems for the analysis of next-generation sequence (NGS) data:

  • HPC systems can have hundreds of thousands of nodes available for analysis and enough storage space for very large (“big data”) files.
  • Many bioinformatics analyses programs are created to work only with Unix-based operating systems (MacOS and various flavors of Linux).
  • Pipelines can be built that link together the output and input of a series of analysis tools. By streamlining analyses, the process is less error-prone with fewer opportunities for data to get corrupted or lost.
  • Eventually, MS Excel Spreadsheets will not be enough to capture the size and complexities of your NGS data.

Our local system, Biowulf/Helix, is a Linux cluster – the Linux OS, a variety of Unix OS, is a common, free, open-source OS- containing over 100,000 compute nodes. This dedicated resource is available to everyone at NIH ($35/month). Once you have your Biowulf/Helix account, you will be able to upload your sequence data, analyze it, and create visualizations and figures for publication. No need to worry about keeping up with the latest versions of bioinformatics analyses software, as the HPC staff does a great job with updating and maintaining the software. Can’t find your favorite tool on Biowulf? Work with HPC staff to get it installed or discover other options already existing on the system. Programming tools like Python and R/RStudio are easily installed and frequently included in Unix systems. Biowulf/Helix also has a built-in system for moving large data files, to get your sequences from the sequencing facility or share with collaborators (Globus).

What kind of commitment does it really take to learn Unix command line? Keep in mind that you do not need to learn hundreds or thousands of Unix commands or complicated scripts to get started. Starting with about 12 commands, most scientists can get started in the Unix system – creating and maintaining directories and data files, running bioinformatics analyses programs, and working with data output for further analyses or visualization. You may be working on a Unix machine and not even know it! Unix is the underlying OS for Mac computers. The commands you learn for Unix will work on your Mac and provide you with another way to work with your data.

BTEP offers several options to help you learn Unix. Check out Module 1 of our recent Bioinformatics for Beginners course series or our recent Unix on Biowulf course series. Also, regularly check the BTEP calendar for upcoming classes offered by BTEP and the HPC Biowulf teams. Or – if you’d like to learn at your own pace, request a Dataquest license.

— Amy Stonelake (BTEP)