Skip to content

Lesson 1: Introduction to Unix and the Shell

Lesson Objectives

  1. Review the course syllabus and general structure of lessons to come.
  2. Introduce Unix and describe how it differs from other operating systems.
  3. Introduce and get set up on DNAnexus and the GOLD system.
  4. Discuss ways to use the command line outside of the DNAnexus teaching environment.
  5. Introduce conda.

Why Bioinformatics?

A word about Statistics

What is Unix?

  1. An operating system, just like Windows or MacOS

  2. Something that is worth learning

  3. Sometimes used interchangeably with Linux, which for our purposes, is just a version of Unix

Why learn Unix?

  1. Many tools (like a bazillion) for biological data analysis are freely available and supported on Unix systems

  2. Useful for working with big data, like genomic sequence files

  3. To use the NIH High Performance Cluster (HPC) Biowulf for data analysis

A few things about the Unix shell...

  1. It gives a command line interface where users can type commands

  2. Also a scripting language, used to automate repetitive tasks

  3. The Bash shell (the Bourne Again SHell) is the most popular Unix shell.

How is Unix different from other operating systems?

  1. Does not use a Graphical User Interface (GUI) better known as a "point and click" environment.

  2. The user has to learn a series of commands for interacting with a Unix system

  3. BUT...a few commands, like the ones we will learn over the next several lessons, will allow us to employ a number of bioinformatics tasks

How much Unix do I need to know to get started?

As with any language, the learning curve for Unix can be quite steap. However, to get started analyzing data you really need to understand the following:

  1. Directory navigation: what the directory tree is, how to navigate and move around with cd
  2. Absolute and relative paths: how to access files located in directories
  3. What simple Unix commands do: ls, mv, rm, mkdir, cat, man
  4. Getting help: how to find out more on what a unix command does
  5. What are “flags”: how to customize typical unix programs ls vs ls -l
  6. Shell redirection: what is the standard input and output, how to “pipe” or redirect the output of one program into the input of the other --- Biostar Handbook

These will be the primary focus of lesson 2.

Getting started with DNAnexus

What is DNAnexus?

DNAnexus provides a secure cloud based platform for the analysis and sharing of next generation sequencing data. This class will use a pre-built teaching environment, the GOLD System, which includes all of the software needed installed and ready to go.

Obtaining a DNAnexus account

If you have not already created a free DNAnexus account, please do so here. Once you have obtained your free account, you will need to email us your username at ncibtep@nih.gov to obtain access to the course page and GOLD System.

Finding the course and getting started with the GOLD system

ADD INSTRUCTIONS ONCE NEW COURSE IS SET UP.

Getting started outside of DNAnexus

Ultimately we want you to be able to get started analyzing data on your own without having to use the GOLD teaching environment. Most bioinformatics software will work with unix based systems (MacOS or Linux). Therefore, if you are working on a Windows operating system, you will need a work around.

Working with command line from a macbook

  • Type cmd + spacebar and search for "terminal". Once open, right click on the app logo in the dock. Select Options and Keep in Dock.

  • The default shell starting with Mac OSX version 10.14 is the zsh shell. While this is not really a problem, you can configure your computer to use the bash shell using the following:

chsh -s /bin/bash
* Follow Biostar instructions if you want bioinfo, which contains the software used in the book, installed on your local machine. Regardless, you will likely need to install the Xcode compiler. You can search for and install this directly from the "App Store". Then install the additional Xcode command line tools from the terminal using

xcode-select --install

Working with command line from a Windows computer

If you are using a Windows operating system, Windows 10 or greater, you can use the Windows Subsystem for Linux (WSL) for your computational needs.

The Windows Subsystem for Linux (WSL) is a feature of the Windows operating system that enables you to run a Linux file system, along with Linux command-line tools and GUI apps, directly on Windows, alongside your traditional Windows desktop and apps. --- docs.microsoft.com

To install WSL, you will need to submit a help ticket to service.cancer.gov. There are multiple Linux distributions. We recommend new users install "Ubuntu".

If you do not plan to use your local machine for bioinformatics analyses, you can connect to the NIH HPC Biowulf using an SSH client. The secure shell (ssh) protocol is commonly used to connect to remote servers. More on Biowulf later.

Windows now has its own connectivity tool using the SSH protocol called OpenSSH. If this yields any major issues, try installing PuTTY, Solar-PuTTY, or MobaXterm.

Note about powershell: Just like the bash shell works effectively with a linux operating system, Windows also has its own shell to interact with the Windows operating system. This shell is the powershell. However, because most bioinformatics software is unix based, this shell will not be useful for bioinformatics scripting.

What is conda?

The Biostar Handbook works with programs installed within a conda environment named bioinfo. Conda is commonly used for bioinformatics package installations.

Conda is often used for scientific software installation because...

Installing software is hard. Installing scientific software is often even more challenging. In order to minimize the burden of installing and updating software (data) scientists often install software packages that they need for their various projects system-wide.

Installing software system-wide has a number of drawbacks:

It can be difficult to figure out what software is required for any particular research project. It is often impossible to install different versions of the same software package at the same time. Updating software required for one project can often “break” the software installed for another project. --- Pugh and Tocknell, Introduction to Conda for (Data) Scientists

Conda solves these problems by facilitating software installations, making the installation process far easier. As a package and environment management system, conda also enhances both the portability and reproducibility of scientific workflows by isolating software and their dependencies in "environments". These environments do not interact with system wide programs and therefore due not reek havoc on your local machine due to software incompatibilites.

Conda runs on Windows, macOS, Linux and z/OS. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language. --- docs.conda.io

### Activating / deactivating a conda environment

To activate a conda environment use

conda activate bioinfo  

To deactivate your environment

conda deactivate