Lesson 3: Useful Unix
Lesson 2 Review
- Directories are folders that can store files, other directories, links, executables, etc.
- The file system is hierarchical with the root directory (
/
) at the top. - Commands useful for navigating our file system and handling files include:
pwd
= print working directoryls
= list contentscd
= change directorymkdir
,rmdir
= make directory; remove directoryrm
= remove filetouch
= create filenano
opens text editor
- absolute file paths include the entire location from the root of the file system
- relative file paths include the location from the working directory
Learning Objectives
In this lesson, we will continue to learn tips and concepts that facilitate working on the command line, including
- Flags and command options - making programs do what they do
- Wildcards (e.g.,
*
) - Tab complete for less typing
- Accessing user history with the "up" and "down" arrows on the keyboard
cat
,head
, andtail
for reading files- Working with file content (input, output, and append)
- Combining commands with the pipe (
|
). - Finding information in files with
grep
- Performing repetitive actions with Unix (e.g.,
for loop
) - File Permissions
For this lesson, you will need to connect to Biowulf.
Reminder: How to Connect to Biowulf
For this lesson and the lessons that follow, we will use NIH HPC student accounts to connect to Biowulf.
Open a terminal and type the following:
username
= NIH/Biowulf login username. Remember to use the student account username here.
Type in your password at the prompt. The cursor will not move as you type your password!
Congrats! You are now connected to the login node. Your current working directory will be your home directory.
Getting Started
To begin this lesson, let's move to our data directory /data/$USER
.
Let's also grab some files that we will need for this lesson from BTEP Teaching Materials.
cp
does? See Lesson 2.
Navigate to your Module_1
directory.
Flags and command options - making programs do what they do
In Lesson 2, I introduced the idea of command options (flags). Command options allow us to change the behavior of a command. Command options, which ultimately allow you to modify command parameters, are extremely important to obtain expected results.
Let's return to ls
for an example. Compare the output from these commands:
ls -h
(when used with -l
option, prints file sizes in a human readable format with the unit suffixes: Byte, Kilobyte, Megabyte, Gigabyte, Terabyte. This reduces the number of digits displayed.)
ls -l
(list in long format). The column output is as follows:
- directory / file type
- Content permissions
- Number of hard links to content
- Owner
- Group owner
- Content size (bytes)
- Last modified date / time
- File / directory name
ls -S
(sort from largest to smallest file size)
What do you see when combining the -h
and -l
flags?
There are many flags you can use with ls
. How would we find out what they are?
Or to see a more user friendly display, google to the rescue. Google "man ls unix" and see what you get. Here's a useful, readable explanation of the "ls" command with examples.
Try combining some of the ls
flags.
-a
(show hidden dot files .
)
-t
(sort by modification time with most recent listed first)
Flags and options add a layer of complexity to unix commands but are necessary to get the command or program to behave in the way you expect. For example, here is a command line for running "blastn" an NCBI/BLAST application. The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences.
module load blast # Try this out by first loading the blast module
blastn -db /fdb/blastdb/nt -query seq1.fasta -out results.out
What's going on in this command line? First, the BLAST algorithm is specified, in this case it is blastn
, then the -db
flag is used to choose the database to search against (nt
for nucleotide). The query
flag specifies the input sequence, which is in FASTA format, and the -out
flag specifies the name of the output file. See more about using Blast on NIH Biowulf here.
Note
We are not running this today. This is just an example. If you want to run this, make sure you are not on the login node (Use sinteractive
) and be sure to allocate enough memory. See the Biowulf Blast help docs.
Use of wildcards
Wildcard characters are a handy tool when working at the command line. Wildcards "are symbols or special characters that represent other characters." Want to list all your FASTA files?
Use *
:
*
wildcard matches zero or more characters including spaces.
This example will work as long as all your FASTA files end in .fasta
. But sometimes they don't (e.g., fas or fa). Read more about FASTA file extensions here.
For example, to find a file ending in .fa
, you could use the following:
In addition to the asterisk (*
) character, you can use other wildcards on the unix command line, not limited to the following:
?
- matches a single character
{}
- used for multiple matches
[]
- specify a range of characters or numbers. For example, [a-z]
would specify all lower case letters and [0-9]
would mean all digits.
To see some more practical examples of using wildcards, see this article from tecmint.com and this from the medium.com. This second article provides a nice discussion on how wildcards differ from regular expressions.
Using tab complete for less typing
Here's a good Unix trick to know - tab complete. Start typing the name of the file or directory you want, and hit the tab
key. The system will auto-complete the name of the file or directory if the name is unique. It may only give a partial name if there is more than one file with a similar name in the directory. You can fill in the next part of the name and then try tab-complete again.
Let's create some files to test this.
Start typing...
What do you get? How does that differ from:
The tab complete will save you lots of typing, and also help to figure out if you are where you think you are in the directory structure.
Access your history with the "up" and "down" arrows on your keyboard
Here's another Unix trick to make your life easier. Access previous commands with the up and down arrows on your keyboard. You can scroll backwards and forwards. This helps when you've got a typo or small mistake in your command lines that you can fix without retyping the whole thing.
The history
command
You can also search, view, and retrieve recently using commands using history
. See this guide.
Keyboard shortcuts
There are also a few handy keyboard shortcuts to make life on the command line easier. For example:
ctrl c
to kill a running process
ctrl l
to clear the screen
ctrl a
skip to the beginning of a command
ctrl e
skip to the end of a line
See more examples here.
cat
, head
, and tail
Who says Unix programmers don't have a sense of humor? Let me introduce cat
, head
, and tail
. The cat
command (short for "concatenate") is an extremely useful command for creating new files and viewing file content. You can use it to open files for reading input and writing output. Or you can use it to copy several files into a new file. Also you can "append" the contents of a file to the end of another file.
This command reads the content of sample.fasta
and outputs to standard output (i.e., the screen). This is not helpful for very large files, as it moves to the end of the file quickly. Less
is a better command option for reading large files.
You can use cat
to combine several files into one file, such as:
Although, this again prints to standard output, the screen. To capture that output, we need to learn how to redirect output. (Coming up next!)
In the meantime, let's take a quick look at head
and tail
.
head
- prints the first 10 lines of a specified file (by default)
tail
- prints the last 10 lines of a specified file (by default)
You can specify how many lines you would like to see (-n
), or you can use the default value, which is 10.
Want to be sure of those 20 lines? Let's use cat -n
to show the file with numbered lines and compare.
What if you didn't know what -n
did? How could you find out more about "cat"?
Working with file content (input <
, output >
, and append >>
)
By default, commands take input from the standard input (your keyboard) and send the results to standard output (your screen). If you want to redirect the output (results) of a command to file, you need to use output redirection. Similarly, we can also redirect input, for example, instead of coming from our keyboard, it can instead come from file. This is known as input redirection. Learn more here.
<
- input redirection operator
>
- output redirection operator
>>
- append redirection operator
Want to put the output from cat
, head
, or tail
into a new file?
Or we could put the last 20 lines into a file with tail
.
What if we want the first 20 lines and the last 20 lines in one file, with the first at the top and the last at the bottom? Use append, >>
to paste the second file to the bottom of the first file. Let's try it.
Keep in mind that if you input into the same file multiple times, you are overwriting the previous contents. For example, what is the final content of our file covid.fasta
?
head -n 20 seq1.fasta > covid.fasta
head -n 20 seq2.fasta > covid.fasta
head -n 20 seq3.fasta > covid.fasta
How many lines are now in covid.fasta
? How can you check?
Let's try wc
, short for word count.
wc
is a very useful function. Without opening a file, we can find out how many lines, words and characters are in it. Line counts are extremely useful to assess your data output.
If we created a file where we were expecting there to be 1000 lines of output? The wc
command provides a quick way to check.
What happened to all of our content? The final results are from "seq3.fasta" only. The other two results files have been overwritten.
So, how would you get all three files into covid.fasta
? You'll need to use append.
How could you test to see if the file has the expected number of lines?
To redirect input, we would use something like this:
Yes, you could useless covid.fasta
. However, these commands act differently. In less covid.fasta
, the file is directly passed to the less
. However, with less < covid.fasta
the content of covid.fasta
is passed to less
.
Combining commands with pipe (|
). Where the heck is pipe anyway?
The hardest thing about using pipe (|
) is finding it on your keyboard!
The pipe symbol "|
" (a.k.a., vertical bar) is way over on the right hand side of your keyboard, above the backslash \
.
Pipe is used to take the output from one command, and use it as input for the next command.
For example,
Using what we've learned, all together now.
Let's say you've got a very large FASTA or FASTQ file, and you want to run an analysis on it. Before working on the whole file, it can be useful to set up a smaller test file instead.
Here's one way to do it.
This combines several things we have learned about. The cat
command opens the file sample.fasta
for writing. The pipe |
command is used to take that output and run it through the head
command where we only want to see the first 20 lines, and we want them output >
into a file called "output.fasta".
Note
We could simply use head
without cat
, but we wanted to use the pipe for an example.
Let's compare the files. How are they different?
and
Finding information in files with grep
The grep
utility is used to search files looking for a pattern match. It is used like this.
As our first example we will look for restriction enzyme (EcoRI) sites in a sequence file (eco.fasta). The file has four EcoRI sites, but two of them are across the end of the line (and won't be found).
We can modify the eco.fasta
file to remove the line breaks (\n
) at the ends of the lines using tr
.
-v
: This prints out all the lines that do not match the pattern, effectively removing the header line (>
) of the file.
The unix tr
(translate) command is used for translating or deleting characters.
Usage:
-d
: Deletes characters in the first set from the output
So this part of the command line is finding the line breaks \n
and removing them.
And now we can see all four of the EcoRI sites.
What if we just wanted to count the occurrence of the EcoRI sites in the sequence?
Perhaps "1" is not what you expected. Check outman grep
.
The -c
option prints only a count of the lines that match a pattern. Here we are counting the entire line as one.
If we wanted to see each of the EcoRI sites listed.
And if we want to count the number of EcoRI sites.
Let's create a word file that we can input to grep
. We can input multiple restriction enzyme sites and search for all of them.
Put in the words (GAATTC
, TTTTT
). Now we can use that file to find lines.
We can find a list of options to be used with grep
using man
:
Performing repetitive actions with Unix
We can create a "for" loop to do iterative actions in Unix.
For each commands all on one line or separate lines: (i
can be any variable name). These steps can be saved as a file, thereby creating a simple Unix script.
What does this "for loop" do?
What do these command lines do?
This one pulls out all the header ">" lines in the fasta files. While this one pulls out the header lines from files named seq*.fasta.Learn more about for loops here.
Permissions - when all else fails check the permissions
Permissions dictate who can access your files and directories, and what actions they can perform. If you are consistently getting error messages from your command line and you're sure you are typing it correctly, it's worthwhile checking the permissions. So what does it all mean? How do we read the permissions information?
As we have seen, ls -l
, provides information about file types, the owner of the file, and other permissions.
For example:
Here is an overview of what these permissions mean:
Image from booleanworld.com
The first letter d
indicates whether this is a directory or not -
or some other special file type. The next 3 positions are the owner's/user's permissions. In this image, the owner can "read", "write" and "execute". So they can create files and directories here, read files here, and execute/run programs. The next 3 positions show the permissions for the "group". The last 3 positions shows permissions for everyone ("other").
You can modify permissions using chmod
. Let's see this in action.
Help Session
Let's complete a Unix treasure hunt.