Why Learn Bioinformatics?

Analyze your own data
Expand scientific training and skills
Provide a path to a new career
Have a better understanding of how other people analyze data

What is Unix?

an operating system, just like Windows or MacOS
something that is worthwhile learning
sometimes called Linux, which is a version of Unix

Why learn Unix?

many tools (like a bazillion) for biological data analysis are freely available and supported on Unix systems
useful for working with big data, like genomic sequence files
to use the NIH High Performance Cluster (HPC) Biowulf for data analysis

A few things about the Unix shell...

it gives a command line interface where users can type commands
also a scripting language, used to automate repetitive tasks
the Bash shell is the most popular Unix shell

How is Unix different from other operating systems?

does not use a Graphical User Interface (GUI) better known as a "point and click" environment.
user has to learn a series of commands for interacting with a Unix system

Understanding the Unix Directory Structure

(image)

Unix in 12 Commands or Less

ls (list)
pwd (print working directory)
touch (creates an empty file)
nano (basic editor for creating small text files)
using the "rm" command to remove files. Be careful!
mkdir (make a directory) and rmdir (remove a directory, must be empty of all files)
cd (change directory), by itself will take you home, cd .. (will take you up one directory), cd /results_dir/exp1 (go directly to this directory)
mv (for renaming files or moving files)
less (for viewing files, "more" is the older version of this)
man command (for viewing the man pages when you need help on a command)
wc (word count, line count and character count)
grep

Getting Started

First Unix command (ls)

ls

You may see something like this:

public    reads.tar   sample.fasta    sample.fastq

The "ls" command "lists" the contents of the directory you are in. You may see files and other directories here.

How can you tell the difference between a file and a directory?

ls -lh

will show permissions and indicate directories

Or, many systems are set up to display directories in blue font.

Where am I? (pwd)

pwd

You should see something like this.

/home/username

where username is your name. This is your home directory.

The "pwd" command is very helpful for figuring out where you are in the directory structure. If you are getting "file not found" errors while trying to run something, it is a good idea to "pwd" and see if you are where you think you are. Type the "pwd" command and make a note of the directory you are in.

Creating files (touch)

The touch command creates a file, but the file is empty, so it is not a command you will use very often, but good to know about.

touch file1.txt
touch file2.txt
ls

Now we see something like this.

file1.txt  file2.txt   public    reads.tar   sample.fasta    sample.fastq

The "nano" editor is a text editor useful for small files.

nano file2.txt

Let's put something in this file.

Unix is an operating system, just like Windows or MacOS. Linux is a variety of Unix, and sometimes the names are used interchangeably.

Nano commands for saving your file and exiting nano:

control X - to Exit
Save modified buffer? Y(es)
File name to write: file2.txt (Hit return/enter on your keyboard to save the file with this name and exit nano.)

This brings us to our next topic which is very important!

Choosing good names for your files and directories. There can be no spaces in Unix file names or directories. Here is a good method to use:

Use the underbar (_) where a space would go, like this, to name a directory containing RNA-Seq data.

my_RNA_Seq_data

These are examples of file names:

brain_rna.fastq
liver_rna.fastq

The first part of the file name provides info about the file, and the extension (.fastq) tells what kind of file it is. (Examples of file extensions are .csv, .txt, .fastq, .fasta and many more.)

A word about file extensions

It's important to understand file extensions, to know what kinds of data you are working with.

.txt are text files

.csv are "comma-separated values" - good for importing into MS Excel spreadsheets

.tar.gz indicates a tarred and zipped file - so it is a compressed file

.fastq tells you that these are FASTQ files, containing sequence data and quality scores

.fasta indicates FASTA formatted sequence data, either protein or nucleotide

Removing files with "rm"

Warning - a Unix system will delete something when you ask to delete it and there is usually no way of getting it back.

By adding the "-i" option, the system will ask if you're sure you want to delete. Generally speaking, when a file on a Unix system is deleted, it is gone.

You can modify your profile on a Unix system to always ask before deleting, this is a good idea when you're just getting started.

rm -i file1.txt

will remove a file we created.

Creating (mkdir) and removing (rmdir) directories

A couple things to note - this is a good time to give your directories meaningful names, which will help you keep things organized. For example, for your RNA-Seq data...

mkdir RNA_Seq_data

Removing directories (rmdir)

Directories must be completely empty of all files and other contents before you can delete them. There are ways to "recursively" remove file and directories using the "-r" option, but these can be dangerous. Keep in mind that once these files are deleted they are gone for good. Be extremely careful with the "-r" option. As beginners it can be safer to go to the directory and remove contents.

Getting around the Unix directory structure (cd)

This is a very helpful command used for moving around the directory structure.

It can be used to go to a specific directory. Let's "go to" the directory we just made, and make another directory within it.

cd RNA_Seq_data
pwd
mkdir exp_one
ls
cd exp_one
touch myseq.txt
ls
pwd

So, we've moved to the RNA_Seq_data directory, checked our directory with "pwd", created a directory called exp_one, listed the contents of RNA_Seq_data so we can see the directory we just created, now we go to that directory with "cd", create a file with "touch", list the contents with "ls" and print our working directory.

By itself, the "cd" command takes you "home". Let's try that, and then do a "pwd" to see where we are.

cd
pwd

We are now in our home directory.

/home/username

How can we go back to the exp_one directory we created? We need to give the "path" to that directory.

cd RNA_Seq_data/exp_one
pwd
ls

Check where you are with "pwd" and look at the contents of the directory with "ls". What do you see? It should be the file "myseq.txt".

Here's another way to get around the directory structure using "cd".

cd ~/RNA_Seq_data/exp_one

where the tilde "~" stands for your home directory.

How is this command different from the last one?

cd RNA_Seq_data/exp_one

The first "cd" command provides the full path to where you want to go, it is called an "absolute" path.

For the second version, you need to be in the directory that contains /RNA_Seq_data, or the command will not work. This is known as a "relative" path.

Paths are the sequence of directories that hold your data. In this path...

~/RNA_Seq_data/exp_one

there is a directory named "exp_one", within a directory named "RNA_Seq_data", within our home directory.

You will become more comfortable with paths as you build up your directories and data.

Another way to use the "cd" command is to go up one level in the directory structure, like this.

cd ..

This can be very helpful as you move around the directory tree. There are many more ways to use the "cd" command.

Getting back to removing directories (rmdir)

Directories must be completely empty of all files and other contents before you can delete them. There are ways to "recursively" remove file and directories using the "-r" option, but these can be dangerous. Keep in mind that once these files are deleted they are gone for good. Be extremely careful with the "-r" option. As beginners it can be safer to go to the directory and remove contents.

What do you see when you try to remove this directory?

rmdir exp_one

What should we do? We need to remove the contents of a directory before we can remove the directory. Here's one safe option.

cd exp_one
ls
rm myseq.txt
ls
cd ..
ls
rmdir exp_one

Moving and renaming files and directories, all with one command (mv)

The "mv" command is a handy way to rename files if you've created them with a typo or decide to use a more descriptive name. For example:

cd
mv file2.txt README.txt
ls

Be careful when moving files, a mistake in the command can yield unexpected results. The "-i" interactive option will help keep you safe.

mv  -i README.txt RNA_Seq_data
cd RNA_Seq_data
ls

and for directories...

mkdir dir1
mkdir dir2
mv dir1 dir2
cd dir2
ls

Less is more and more is less (less). We can use the less command to view the contents of a file like this.

less sample.fasta

You'll need to type "q" to get out of less and back to the command line. This is easy to forget so I'll repeat it, remember to hit "q" to get out of less and back to the command line. Before the "less" command was available, the "more" command was commonly used to look at file content. The "less" command has more options for scrolling through files, so it is now the preferred command.

Help! (man)

All Unix commands have a "man" or "manual" page that describes how to use them. If you need help remembering how to use the command "ls", you would type:

man ls

There are quite a few flags/options that we can use with the "ls" command, and we can learn all about them on the man page. My favorite flags for "ls" are "-l" and "-h". We'll learn more about flags in a the next section. Basically they modify the behavior of a command. You won't get far in Unix without knowing about flags. Try this:

cd
ls -lh

-h When used with the -l option, use unit suffixes: Byte, Kilobyte, Megabyte, Gigabyte, Terabyte and Petabyte in order to reduce the number of digits to three or less using base 2 for sizes.

-l (The lowercase letter ``ell''.) List in long format. (See below.) If the output is to a terminal, a total sum for all the file sizes is output on a line before the long listing.

Compare the results between these two commands.

cd
ls
ls -lh

Counting lines, words and characters (wc)

This is a very useful function. Without opening a file, we can find out how many lines, words and characters are in it. Line counts are extremely useful to assess your data output.

wc file2.txt
     1      23   135 file2.txt

So in our file2.txt, which we edited with nano, there are 1 line(s), 23 words, and 135 characters. What if we created a file where we were expecting there to be 1000 lines of output? The "wc" command provides a quick way to check.