Unix for Bioinformatics Beginners
ssh
(connect to remote computer such as a high performance computing cluster)pwd
(print working directory)cd
(change directory)ls
(list directory content)mkdir
(make a directory) and rmdir (remove a directory, must be empty of all files)touch
(creates an empty file)nano
(basic editor for creating small text files)cp
(make a copy of files/folders)mv
(rename files/folder or moving files/folders)rm
(to remove files or folders)cat
(print file content to screen)less
(page through file content)column
(dislpay tabular data nicely aligned on screen)wc
(word count, line count and character count)grep
(pattern search)module
(find, load, and unload applications installed on high performance computing clusters)man
(view manual for command)
Tip
All Unix commands follow the syntax of command option(s) input
. The input
is what users would like the command to act on and option(s)
changes the default way in which a command runs.
Connecting to Remote Computer
A motive for learning Unix command line is to enable scientists to perform analyses on high performance computing systems such as Biowulf at NIH. To connect to a remote computer, the ssh
command construct below can be used and the breakdown is as follows.
ssh
: The command for connecting to remote computer.username
: The user name that is used to sign onto a remote computer such as high performance computing (for Biowulf,username
will be the user's NIH user name). Username is followed by@
(ie. "at").remote
: This is the name of the remote computer (Biowulf would bebiowulf.nih.gov
) to connect to.
ssh username@remote
To sign onto Boiowulf, the NIH high performance computing system, use the following. Again, replace username with user's NIH user name.
ssh username@biowulf.nih.gov
Print the Working Directory
Definition
The working directory is the folder in a computer system in which a user is currently in.
It is a good idea to know which directory a user is currently in as a data analysis progresses. To check, use the following.
pwd
For instance, upon signing on to Biowulf users will land in their home
directory. Doing pwd
will give the directory path below (replace username with the user's specific user name).
/home/username
Note
In Unix, a directory path indicates where in the file system hierarchy a user is. Directory paths that start with /
are known as absolute. Each part of the directory path is separated by /
and followed by the folder in the file system hierarchy.
Change Directory
Note
The home
directory on Biowulf is not the place to store analysis input and output. The data
directory can be used for this.
To change directory use the cd
command follow by the directory to change into. For instance, upon signing onto Biowulf, users should change into their data
directory to performing analyses. To do this use the following. Replace username with the user's specific user name.
cd /data/username
The pwd
command can be used to check whether the cd
command was used successfully.
pwd
/data/username
To go back to the user's home directory use cd
, cd ~
, or cd /home/username
. Use cd ..
to go back one directory.
Listing Directory Content
To view the files and subfolder within a directory, the ls
command can be used. For instance, ls
will retrieve the following items in a directory when issued.
file.txt
Recall that Unix commands come with options that alter its default behavior. The -l
option of ls
gives a detailed view of each item in a directory. For instance, on the left hand column of the table below, lines that start with -
are files while those starting with d
are folders. The fifth column list the file or folder size. Note file.txt has a file sizes of 0 since it is empty.
-rw-r-----. 1 owner group 0 Jun 15 16:20 file.txt
Creating a New Directory
To make a new folder, use the mkdir
command follow by the name of the folder.
mkdir folder
Recall
As mentioned previously, if the first column in the ls -l
results starts with d
then this indicates a directory, which is what folder
should be.
ls -l
-rw-r-----. 1 owner group 0 Jun 16 10:09 file.txt
drwxr-x---. 2 owner group 4096 Jun 16 10:18 folder
Editing Files
Nano is a basic file editor that is built into Unix and enables editing of plain text files including txt, csv, fasta, genbank, gtf, and fastq. To edit a file using Nano, just use nano
followed by the file name.
Unix is an operating system, just like Windows or MacOS.
Linux is a variety of Unix, and sometimes the names are used interchangeably.
For instance, to add the above text to file.txt, use:
nano file.txt
Then copy and paste the text into the editor. Hit control-x and then save to go back to the command prompt.
Note
If a file does not exist, then the nano
command will create it as well as open the editor to enable editing. To create a blank file without opening the Nano editor, use the touch
command instead.
touch bioinformatics.txt
ls -l
-rw-r-----. 1 owner group 0 Jun 16 18:13 bioinformatics.txt
-rw-r-----. 1 owner group 135 Jun 16 18:13 file.txt
drwxr-x---. 2 owner group 4096 Jun 16 10:18 folder
Copying Files and Folders
To copy a file use cp
followed by the file name and the name of the duplicate.
For instance, to make a copy of bioinformatics.txt called btep_bioinformatics.txt, do:
cp bioinformatics.txt btep_bioinformatics.txt
ls
bioinformatics.txt btep_bioinformatics.txt file.txt folder
The cp
command can be used to copy a file into a folder.
For instance:
cp bioinformatics.txt folder/bioinformatics_for_beginners.txt
Tip
To see contents of a folder other than those in the working directory, supply the path to the folder to the ls
command.
ls folder
bioinformatics_for_beginners.txt
To copy a folder include the -r
option in cp
. The -r
option recursively copies everything in a folder. The arguments are the folder to copy and the name of the duplicate folder.
cp -r folder bioinformatics_folder
The -1
option of ls
prints directory content one item per line.
ls -1
bioinformatics_folder
bioinformatics.txt
btep_bioinformatics.txt
file.txt
folder
ls bioinformatics_folder
bioinformatics_for_beginners.txt
Renaming Files and Folders
The mv
command can be used to rename files and folders.
To rename a file, use the mv
command followed by the name of the file to rename and the new name of the file.
mv btep_bioinformatics.txt bioinformatics_for_noobies.txt
ls -1
bioinformatics_folder
bioinformatics_for_noobies.txt
bioinformatics.txt
file.txt
folder
To rename a folder, use the mv
command followed by the name of the folder to rename and the new name of the folder.
mv folder project_folder
ls -1
bioinformatics_folder
bioinformatics_for_noobies.txt
bioinformatics.txt
file.txt
project_folder
Moving Files and Folders
To move a file from one folder to another, the mv
command can be used. The arguments in this application of the mv
command is the file to be moved and the name of folder in which the file will be moved to.
mv bioinformatics.txt bioinformatics_folder
ls bioinformatics_folder
bioinformatics_for_beginners.txt bioinformatics.txt
Tip
mv
can also move folders. Just supply the name of the folder to be moved as the first argument and the path of destination folder.
Deleting Files and Folders
To delete a file use the rm
command followed by the file name.
For instance, to delete bioinformatics_for_noobies.txt do:
rm bioinformatics_for_noobies.txt
Warning
There is no trash can or recycling bin in Unix. Once a file is deleted then it cannot be recovered. Use the -i
option with rm
to confirm deletion.
rm -i bioinformatics_for_noobies.txt
Type n
for no and y
for yes (to delete).
rm: remove regular empty file 'bioinformatics_for_noobies.txt'? n
Empty folders can be deleted using rmdir
. On the other hand, folders with content can be removed using rm -r
, where the -r
option tells rm
to delete the folder and recursively delete everything in side that folder.
rm -r -i project_folder
The command above will ask if the user wants to delete the project_folder and everything in it. Type y
to confirm deletion. Users will be asked to confirm the deletion of each item in the folder.
rm: descend into directory 'project_folder'? y
rm: remove regular empty file 'project_folder/bioinformatics_for_beginners.txt'? y
rm: remove directory 'project_folder'? y
ls -1
bioinformatics_folder
bioinformatics_for_noobies.txt
file.txt
Viewing File Content
The cat
command will print the entire content of a file to the terminal screen.
For instance, to view the content of file.txt in the terminal screen do:
cat file.txt
Unix is an operating system, just like Windows or MacOS.
Linux is a variety of Unix, and sometimes the names are used interchangeably.
The cat
command can also be used to view fastq or fq files, which contain sequences from high throughput sequencers.
cat example_fastq.fq
Tabular data in the form of csv files can be displayed by cat
as well.
cat example_rna_sequencing_counts.csv
Paging through Files
Rather than printing file content in its entirety to the terminal, users can page through files using less
.
less example_rna_sequencing_counts.csv
Users can use the up and down arrow on the keyboard to scroll up/down the file to view content. The up/down arrows enable scroll line by line.
Printing Tabular Data to Terminal Nicely Aligned
For tabular data in the form of csv files, which could contain multiple columns, the columns do not print to the terminal nicely aligned. The column
command can fix this.
The options and arguments in the column command include:
- Option
-t
: Creates a table. - Option
-s
: Prompts users to provide the column separator (ie. comma for csv files) - Argument: Name of file which users would like to view (ie. example_rna_sequencing_counts.csv)
column -t -s ',' example_rna_sequencing_counts.csv
Geneid HBR_1.bam HBR_2.bam HBR_3.bam UHR_1.bam UHR_2.bam UHR_3.bam
U2 0 0 0 0 0 0
CU459211.1 0 0 0 0 0 0
Word, Character, and Line Count in a File
The wc
command is used to obtain word, character, and line count in a file.
wc file.txt
The output (from left to right) for wc
can be interpreted as follows.
- 2 indicates that the file has two lines.
- 23 indicates that the file has 23 words.
- 135 indicates that the file has 135 characters.
- The file in which statistics were generated (ie. file.txt).
2 23 135 file.txt
The wc
results above can be obtained separately. For instance, to just get the number of lines in a file include the -l
option.
wc -l file.txt
2 file.txt
Word count can be obtained using the -w
option.
wc -w file.txt
23 file.txt
Character count can be obtained uisng the -m
option.
wc -m file.txt
135 file.txt
Pattern Searching
Sometimes users may want to search for a keyword in a file. The grep
command can be used to do this. The grep
command prints every line in a file that contains the search pattern. The arguments in grep
are as follows.
- Search pattern (ie. Linux)
- File to search (ie. file.txt)
grep Linux file.txt
Linux is a variety of Unix, and sometimes the names are used interchangeably.
To search for lines that do not contain a pattern include the -v
option.
grep -v Linux file.txt
Unix is an operating system, just like Windows or MacOS.
What would happen if "linux" is used as the search pattern instead? Nothing is presented because grep
is case sensitive. To ignore case, include the -i
option.
grep linux file.txt
grep -i linux file.txt
Linux is a variety of Unix, and sometimes the names are used interchangeably.
Working with Software Installed on Biowulf
The module
command is important for anyone who wishe to work with software that are installed on Biowulf. This avail
option enables users to browse the software that are available on the cluster.
module avail
Glimpse of module avail
results. Users can scroll up/down using the arrow keys to learn what software are available.
----------------------------------------------------- Global Aliases -----------------------------------------------------
bowtie1 -> bowtie/1.3.1 deeptrio/1.6.0 -> deepvariant/1.6.0-deeptrio
bowtie2 -> bowtie/2-2.5.3 deeptrio/1.6.1 -> deepvariant/1.6.1-deeptrio
deeptrio/1.5.0 -> deepvariant/1.5.0-deeptrio
-------------------------------------------- /data/classes/BTEP/apps/modules ---------------------------------------------
biostars/1.0
---------------------------------------------- /usr/local/lmod/modulefiles -----------------------------------------------
3DSlicer/4.8.1 hwloc/2.9.3/gcc-8.5.0
3DSlicer/5.2.2 (D) hwloc/2.9.3/gcc-9.2.0
To find if a particular software is available include the software name with the avail
option.
module avail fastqc
----------------------------------------------------- Global Aliases -----------------------------------------------------
---------------------------------------------- /usr/local/lmod/modulefiles -----------------------------------------------
fastqc/0.11.8 fastqc/0.11.9 fastqc/0.12.1 (D)
Where:
D: Default Module
Module defaults are chosen based on Find First Rules due to Name/Version/Version modules found in the module tree.
See https://lmod.readthedocs.io/en/latest/060_locating.html for details.
If the avail list is too long consider trying:
"module --default avail" or "ml -d av" to just list the default modules.
"module overview" or "ml ov" to display the number of modules for each name.
Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
To load a software use module load
followed by the name of the package.
module load fastqc
[+] Loading singularity 4.0.1 on cn4288
[+] Loading fastqc 0.12.1
Getting help
To get help, use the man
command which pulls up the manual for a command. Use the up and down arrow keys to scroll through the manual and learn about the different options.
For instance, to pull up the manual for grep
, do the following.
man grep