Lesson 1: Introduction to Unix and the Shell
Lesson Objectives
- Review the course syllabus and general structure of lessons to come.
- Introduce Unix and describe how it differs from other operating systems.
- Introduce and get set up on DNAnexus and the GOLD system.
- Discuss ways to use the command line outside of the DNAnexus teaching environment.
- Introduce conda.
Why Bioinformatics?
A word about Statistics
What is Unix?
-
An operating system, just like Windows or MacOS
-
Something that is worth learning
-
Sometimes used interchangeably with Linux, which for our purposes, is just a version of Unix
Why learn Unix?
-
Many tools (like a bazillion) for biological data analysis are freely available and supported on Unix systems
-
Useful for working with big data, like genomic sequence files
-
To use the NIH High Performance Cluster (HPC) Biowulf for data analysis
A few things about the Unix shell...
-
It gives a command line interface where users can type commands
-
Also a scripting language, used to automate repetitive tasks
-
The Bash shell (the Bourne Again SHell) is the most popular Unix shell.
How is Unix different from other operating systems?
-
Does not use a Graphical User Interface (GUI) better known as a "point and click" environment.
-
The user has to learn a series of commands for interacting with a Unix system
-
BUT...a few commands, like the ones we will learn over the next several lessons, will allow us to employ a number of bioinformatics tasks
How much Unix do I need to know to get started?
As with any language, the learning curve for Unix can be quite steap. However, to get started analyzing data you really need to understand the following:
- Directory navigation: what the directory tree is, how to navigate and move around with cd
- Absolute and relative paths: how to access files located in directories
- What simple Unix commands do: ls, mv, rm, mkdir, cat, man
- Getting help: how to find out more on what a unix command does
- What are “flags”: how to customize typical unix programs ls vs ls -l
- Shell redirection: what is the standard input and output, how to “pipe” or redirect the output of one program into the input of the other --- Biostar Handbook
These will be the primary focus of lesson 2.
Getting started with DNAnexus
What is DNAnexus?
DNAnexus provides a secure cloud based platform for the analysis and sharing of next generation sequencing data. This class will use a pre-built teaching environment, the GOLD System, which includes all of the software needed installed and ready to go.
Obtaining a DNAnexus account
If you have not already created a free DNAnexus account, please do so here. Once you have obtained your free account, you will need to email us your username at ncibtep@nih.gov to obtain access to the course page and GOLD System.
Finding the course and getting started with the GOLD system
ADD INSTRUCTIONS ONCE NEW COURSE IS SET UP.
Getting started outside of DNAnexus
Ultimately we want you to be able to get started analyzing data on your own without having to use the GOLD teaching environment. Most bioinformatics software will work with unix based systems (MacOS or Linux). Therefore, if you are working on a Windows operating system, you will need a work around.
Working with command line from a macbook
-
Type
cmd + spacebar
and search for "terminal". Once open, right click on the app logo in the dock. SelectOptions
andKeep in Dock
. -
The default shell starting with Mac OSX version 10.14 is the
zsh
shell. While this is not really a problem, you can configure your computer to use the bash shell using the following:
chsh -s /bin/bash
bioinfo
, which contains the software used in the book, installed on your local machine. Regardless, you will likely need to install the Xcode compiler. You can search for and install this directly from the "App Store". Then install the additional Xcode command line tools from the terminal using
xcode-select --install
Working with command line from a Windows computer
If you are using a Windows operating system, Windows 10 or greater, you can use the Windows Subsystem for Linux (WSL) for your computational needs.
The Windows Subsystem for Linux (WSL) is a feature of the Windows operating system that enables you to run a Linux file system, along with Linux command-line tools and GUI apps, directly on Windows, alongside your traditional Windows desktop and apps. --- docs.microsoft.com
To install WSL, you will need to submit a help ticket to service.cancer.gov. There are multiple Linux distributions. We recommend new users install "Ubuntu".
If you do not plan to use your local machine for bioinformatics analyses, you can connect to the NIH HPC Biowulf using an SSH client. The secure shell (ssh
) protocol is commonly used to connect to remote servers. More on Biowulf later.
Windows now has its own connectivity tool using the SSH protocol called OpenSSH
. If this yields any major issues, try installing PuTTY, Solar-PuTTY, or MobaXterm.
Note about powershell: Just like the bash shell works effectively with a linux operating system, Windows also has its own shell to interact with the Windows operating system. This shell is the powershell. However, because most bioinformatics software is unix based, this shell will not be useful for bioinformatics scripting.
What is conda?
The Biostar Handbook works with programs installed within a conda environment named bioinfo
. Conda
is commonly used for bioinformatics package installations.
Conda is often used for scientific software installation because...
Installing software is hard. Installing scientific software is often even more challenging. In order to minimize the burden of installing and updating software (data) scientists often install software packages that they need for their various projects system-wide.
Installing software system-wide has a number of drawbacks:
It can be difficult to figure out what software is required for any particular research project. It is often impossible to install different versions of the same software package at the same time. Updating software required for one project can often “break” the software installed for another project. --- Pugh and Tocknell, Introduction to Conda for (Data) Scientists
Conda
solves these problems by facilitating software installations, making the installation process far easier. As a package and environment management system, conda
also enhances both the portability and reproducibility of scientific workflows by isolating software and their dependencies in "environments". These environments do not interact with system wide programs and therefore due not reek havoc on your local machine due to software incompatibilites.
Conda runs on Windows, macOS, Linux and z/OS. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language. --- docs.conda.io
### Activating / deactivating a conda environment
To activate a conda environment use
conda activate bioinfo
To deactivate your environment
conda deactivate