Unix for Bioinformatics Beginners

ssh (connect to remote computer such as a high performance computing cluster)
pwd (print working directory)
cd (change directory)
ls (list directory content)
mkdir (make a directory) and rmdir (remove a directory, must be empty of all files)
touch (creates an empty file)
nano (basic editor for creating small text files)
cp (make a copy of files/folders)
mv (rename files/folder or moving files/folders)
rm (to remove files or folders)
cat (print file content to screen)
less (page through file content)
column (dislpay tabular data nicely aligned on screen)
wc (word count, line count and character count)
grep (pattern search)
module (find, load, and unload applications installed on high performance computing clusters)
man (view manual for command)

Tip

All Unix commands follow the syntax of command option(s) input. The input is what users would like the command to act on and option(s) changes the default way in which a command runs.

Connecting to Remote Computer

A motive for learning Unix command line is to enable scientists to perform analyses on high performance computing systems such as Biowulf at NIH. To connect to a remote computer, the ssh command construct below can be used and the breakdown is as follows.

ssh: The command for connecting to remote computer.
username: The user name that is used to sign onto a remote computer such as high performance computing (for Biowulf, username will be the user's NIH user name). Username is followed by @ (ie. "at").
remote: This is the name of the remote computer (Biowulf would be biowulf.nih.gov) to connect to.

ssh username@remote

To sign onto Boiowulf, the NIH high performance computing system, use the following. Again, replace username with user's NIH user name.

ssh username@biowulf.nih.gov

Print the Working Directory

Definition

The working directory is the folder in a computer system in which a user is currently in.

It is a good idea to know which directory a user is currently in as a data analysis progresses. To check, use the following.

pwd

For instance, upon signing on to Biowulf users will land in their home directory. Doing pwd will give the directory path below (replace username with the user's specific user name).

/home/username

Note

In Unix, a directory path indicates where in the file system hierarchy a user is. Directory paths that start with / are known as absolute. Each part of the directory path is separated by / and followed by the folder in the file system hierarchy.

Change Directory

Note

The home directory on Biowulf is not the place to store analysis input and output. The data directory can be used for this.

To change directory use the cd command follow by the directory to change into. For instance, upon signing onto Biowulf, users should change into their data directory to performing analyses. To do this use the following. Replace username with the user's specific user name.

cd /data/username

The pwd command can be used to check whether the cd command was used successfully.

pwd

/data/username

To go back to the user's home directory use cd, cd ~, or cd /home/username. Use cd .. to go back one directory.

Listing Directory Content

To view the files and subfolder within a directory, the ls command can be used. For instance, ls will retrieve the following items in a directory when issued.

file.txt

Recall that Unix commands come with options that alter its default behavior. The -l option of ls gives a detailed view of each item in a directory. For instance, on the left hand column of the table below, lines that start with - are files while those starting with d are folders. The fifth column list the file or folder size. Note file.txt has a file sizes of 0 since it is empty.

-rw-r-----. 1 owner group    0 Jun 15 16:20 file.txt

Creating a New Directory

To make a new folder, use the mkdir command follow by the name of the folder.

mkdir folder

Recall

As mentioned previously, if the first column in the ls -l results starts with d then this indicates a directory, which is what folder should be.

ls -l

-rw-r-----. 1 owner group    0 Jun 16 10:09 file.txt
drwxr-x---. 2 owner group 4096 Jun 16 10:18 folder

Editing Files

Nano is a basic file editor that is built into Unix and enables editing of plain text files including txt, csv, fasta, genbank, gtf, and fastq. To edit a file using Nano, just use nano followed by the file name.

Unix is an operating system, just like Windows or MacOS.
Linux is a variety of Unix, and sometimes the names are used interchangeably.

For instance, to add the above text to file.txt, use:

nano file.txt

Then copy and paste the text into the editor. Hit control-x and then save to go back to the command prompt.

Note

If a file does not exist, then the nano command will create it as well as open the editor to enable editing. To create a blank file without opening the Nano editor, use the touch command instead.

touch bioinformatics.txt

ls -l

-rw-r-----. 1 owner group    0 Jun 16 18:13 bioinformatics.txt
-rw-r-----. 1 owner group  135 Jun 16 18:13 file.txt
drwxr-x---. 2 owner group 4096 Jun 16 10:18 folder

Copying Files and Folders

To copy a file use cp followed by the file name and the name of the duplicate.

For instance, to make a copy of bioinformatics.txt called btep_bioinformatics.txt, do:

cp bioinformatics.txt btep_bioinformatics.txt

ls

bioinformatics.txt  btep_bioinformatics.txt  file.txt  folder

The cp command can be used to copy a file into a folder.

For instance:

cp bioinformatics.txt folder/bioinformatics_for_beginners.txt

Tip

To see contents of a folder other than those in the working directory, supply the path to the folder to the ls command.

ls folder

bioinformatics_for_beginners.txt

To copy a folder include the -r option in cp. The -r option recursively copies everything in a folder. The arguments are the folder to copy and the name of the duplicate folder.

cp -r folder bioinformatics_folder

The -1 option of ls prints directory content one item per line.

ls -1

bioinformatics_folder
bioinformatics.txt
btep_bioinformatics.txt
file.txt
folder

ls bioinformatics_folder

bioinformatics_for_beginners.txt

Renaming Files and Folders

The mv command can be used to rename files and folders.

To rename a file, use the mv command followed by the name of the file to rename and the new name of the file.

mv btep_bioinformatics.txt bioinformatics_for_noobies.txt

ls -1

bioinformatics_folder
bioinformatics_for_noobies.txt
bioinformatics.txt
file.txt
folder

To rename a folder, use the mv command followed by the name of the folder to rename and the new name of the folder.

mv folder project_folder

ls -1

bioinformatics_folder
bioinformatics_for_noobies.txt
bioinformatics.txt
file.txt
project_folder

Moving Files and Folders

To move a file from one folder to another, the mv command can be used. The arguments in this application of the mv command is the file to be moved and the name of folder in which the file will be moved to.

mv bioinformatics.txt bioinformatics_folder

ls bioinformatics_folder

bioinformatics_for_beginners.txt  bioinformatics.txt

Tip

mv can also move folders. Just supply the name of the folder to be moved as the first argument and the path of destination folder.

Deleting Files and Folders

To delete a file use the rm command followed by the file name.

For instance, to delete bioinformatics_for_noobies.txt do:

rm bioinformatics_for_noobies.txt

Warning

There is no trash can or recycling bin in Unix. Once a file is deleted then it cannot be recovered. Use the -i option with rm to confirm deletion.

rm -i bioinformatics_for_noobies.txt

Type n for no and y for yes (to delete).

rm: remove regular empty file 'bioinformatics_for_noobies.txt'? n

Empty folders can be deleted using rmdir. On the other hand, folders with content can be removed using rm -r, where the -r option tells rm to delete the folder and recursively delete everything in side that folder.

rm -r -i project_folder

The command above will ask if the user wants to delete the project_folder and everything in it. Type y to confirm deletion. Users will be asked to confirm the deletion of each item in the folder.

rm: descend into directory 'project_folder'? y
rm: remove regular empty file 'project_folder/bioinformatics_for_beginners.txt'? y
rm: remove directory 'project_folder'? y

ls -1

bioinformatics_folder
bioinformatics_for_noobies.txt
file.txt

Viewing File Content

The cat command will print the entire content of a file to the terminal screen.

For instance, to view the content of file.txt in the terminal screen do:

cat file.txt

Unix is an operating system, just like Windows or MacOS.
Linux is a variety of Unix, and sometimes the names are used interchangeably.

The cat command can also be used to view fastq or fq files, which contain sequences from high throughput sequencers.

cat example_fastq.fq

Tabular data in the form of csv files can be displayed by cat as well.

cat example_rna_sequencing_counts.csv

Paging through Files

Rather than printing file content in its entirety to the terminal, users can page through files using less.

less example_rna_sequencing_counts.csv

Users can use the up and down arrow on the keyboard to scroll up/down the file to view content. The up/down arrows enable scroll line by line.

Printing Tabular Data to Terminal Nicely Aligned

For tabular data in the form of csv files, which could contain multiple columns, the columns do not print to the terminal nicely aligned. The column command can fix this.

The options and arguments in the column command include:

Option -t: Creates a table.
Option -s: Prompts users to provide the column separator (ie. comma for csv files)
Argument: Name of file which users would like to view (ie. example_rna_sequencing_counts.csv)

column -t -s ',' example_rna_sequencing_counts.csv

Geneid              HBR_1.bam  HBR_2.bam  HBR_3.bam  UHR_1.bam  UHR_2.bam  UHR_3.bam
U2                  0          0          0          0          0          0
CU459211.1          0          0          0          0          0          0

Word, Character, and Line Count in a File

The wc command is used to obtain word, character, and line count in a file.

wc file.txt

The output (from left to right) for wc can be interpreted as follows.

2 indicates that the file has two lines.
23 indicates that the file has 23 words.
135 indicates that the file has 135 characters.
The file in which statistics were generated (ie. file.txt).

  2  23 135 file.txt

The wc results above can be obtained separately. For instance, to just get the number of lines in a file include the -l option.

wc -l file.txt

2 file.txt

Word count can be obtained using the -w option.

wc -w file.txt

23 file.txt

Character count can be obtained uisng the -m option.

wc -m file.txt

135 file.txt

Pattern Searching

Sometimes users may want to search for a keyword in a file. The grep command can be used to do this. The grep command prints every line in a file that contains the search pattern. The arguments in grep are as follows.

Search pattern (ie. Linux)
File to search (ie. file.txt)

grep Linux file.txt

Linux is a variety of Unix, and sometimes the names are used interchangeably.

To search for lines that do not contain a pattern include the -v option.

grep -v Linux file.txt

Unix is an operating system, just like Windows or MacOS.

What would happen if "linux" is used as the search pattern instead? Nothing is presented because grep is case sensitive. To ignore case, include the -i option.

grep linux file.txt

grep -i linux file.txt

Linux is a variety of Unix, and sometimes the names are used interchangeably.

Working with Software Installed on Biowulf

The module command is important for anyone who wishe to work with software that are installed on Biowulf. This avail option enables users to browse the software that are available on the cluster.

module avail

Glimpse of module avail results. Users can scroll up/down using the arrow keys to learn what software are available.

----------------------------------------------------- Global Aliases -----------------------------------------------------
   bowtie1        -> bowtie/1.3.1                  deeptrio/1.6.0 -> deepvariant/1.6.0-deeptrio
   bowtie2        -> bowtie/2-2.5.3                deeptrio/1.6.1 -> deepvariant/1.6.1-deeptrio
   deeptrio/1.5.0 -> deepvariant/1.5.0-deeptrio

-------------------------------------------- /data/classes/BTEP/apps/modules ---------------------------------------------
   biostars/1.0

---------------------------------------------- /usr/local/lmod/modulefiles -----------------------------------------------
   3DSlicer/4.8.1                                           hwloc/2.9.3/gcc-8.5.0
   3DSlicer/5.2.2                               (D)         hwloc/2.9.3/gcc-9.2.0

To find if a particular software is available include the software name with the avail option.

module avail fastqc

----------------------------------------------------- Global Aliases -----------------------------------------------------


---------------------------------------------- /usr/local/lmod/modulefiles -----------------------------------------------
   fastqc/0.11.8    fastqc/0.11.9    fastqc/0.12.1 (D)

  Where:
   D:  Default Module

Module defaults are chosen based on Find First Rules due to Name/Version/Version modules found in the module tree.
See https://lmod.readthedocs.io/en/latest/060_locating.html for details.

If the avail list is too long consider trying:

"module --default avail" or "ml -d av" to just list the default modules.
"module overview" or "ml ov" to display the number of modules for each name.

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

To load a software use module load followed by the name of the package.

module load fastqc

[+] Loading singularity  4.0.1  on cn4288 
[+] Loading fastqc  0.12.1

Getting help

To get help, use the man command which pulls up the manual for a command. Use the up and down arrow keys to scroll through the manual and learn about the different options.

For instance, to pull up the manual for grep, do the following.

man grep