Useful unix

More useful Unix

Flags and command options - making programs do what they do
Use of wildcards
Using tab complete for less typing
Access your history with the "up" and "down" arrows on your keyboard
cat and head and tail
Working with file content (input and output, append)
Combining commands with pipe (|). Where the heck is pipe anyway?
Finding information in files with grep
Performing repetitive actions with Unix
Permissions - when all else fails check the permissions

1. Flags and command options - making programs do what they do

Compare the output from these commands:

ls
ls -S
ls -lh

"ls -h" (when used with "-l" option, use Byte, Kilobyte, Megabyte, Gigabyte, Terabyte so as to reduce the number of digits displayed)

"ls -l" (list in long format, output total sum of all file sizes to terminal)

"-S" (sort from largest to smallest file size)

What do you see? By adding the "-h" and "-l" flags to the "ls" command we can see more details about the files and directories, including permissions (we'll talk more about these in a minute), owner, users and groups, file size and date. There are many flags you can use with "ls" command. How would we find out what they are?

man ls

Or - to see a more user friendly display, google to the rescue. Google "man ls unix" and see what you get. Here's a useful, readable explanation of the "ls" command with examples.

https://shapeshed.com/unix-ls/

Try combining some of the "ls" flags.

ls -lhS
ls -alt

"-a" (show hidden dot files ".")

"-t" (sort by modification time with most recent listed first)

Flags and options add a layer of complexity to unix commands but are necessary to get the command or program to behave in the way you expect. For example, here is a command line for running "blastn" an NCBI/BLAST application.

blastn -db nt -query seq1.fasta -out results.out

What's going on in this command line? First, the BLAST algorithm is specified, in this case it is "blastn", then the "-db" flag is used to choose the database to search against (nt for nucleotide). The "query" flag specifies the input sequence, which is in FASTA format, and the "-out" flag specifies the name of the output file.

2. Use of wildcards

Wildcard characters are a handy tool when working at the command line. Want to list all your FASTA files?

ls *.fasta

This will work as long as all your FASTA files end in ".fasta". But sometimes they don't.

ls *.fa

or you want all the the FASTA and FASTQ files.

ls *.f*

In addition to the asterisk (*) character, you can use other wildcards on the unix command line including "?" and brackets "{}" and "[]".

To see some more practical examples of using wildcards, see https://www.tecmint.com/use-wildcards-to-match-filenames-in-linux/

3. Using tab complete for less typing

Here's a good Unix trick to know - tab complete. Start typing the name of the file or directory you want, and hit the "tab" key. The system will auto-complete the name of the file or dir if the name is unique. It may only give a partial name if there are more than one file with similar names in the directory. You can fill in the next part of the name and then try tab-complete again, as many times as you' like. Let's create some files to test this.

touch file.txt
touch file.fasta
touch file.fastq

Start typing...

less f (hit tab)

what do you get? How does that differ from:

less file (hit tab)

The tab complete will save you lots of typing, and also help to figure out if you are where you think you are in the directory structure.

4. Access your history with the "up" and "down" arrows on your keyboard Here's another Unix trick to make your life easier. Access previous commands with the up and down arrows on your keyboard. You can scroll backwards and forwards. This helps when you've got a typo or small mistake in your command lines that you can fix without retyping the whole thing.

5. cat and head and tail

Who says Unix programmers don't have a sense of humor? Let me introduce cat and head and tail. The "cat" command (short for "concatenate") is an extremely useful command for creating new files and viewing file contents. You can use it to open files for reading input and writing output. Or you can use it to copy several files into a new file. Also you can "append" the contents of a file to the end of another file.

This command reads the content of sample.fasta file and outputs to standard output, which is the screen. This is not helpful for very large files, as it moves to the end of the file quickly. "Less" is a better command option for reading large files.

cat sample.fasta

You can use "cat" to combine several files into one file, such as:

cat file1.txt file2.txt

although this again prints to standard output, the screen. To capture that output, we need to learn how to redirect output. (Coming up next!)

In the meantime, let's take a quick look at commands "head" and "tail".

head

head sample.fasta

tail

tail sample.fasta

You can specify how many lines you would like to see, or you can use the default value, which is 10. Want more?

head -n 20 sample.fasta

Want to be sure of those 20 lines? cat again

cat -n sample.fasta

The "cat" command can be used in more ways than what is shown here. How could you find out more about "cat"?

man cat

6. Working with file content (input <, and output >, append >>)

<
>
>>

Want to put the output from "cat", "head" or "tail" into a new file?

head -n 20 seq1.fasta > smaller.fasta

Or we could put the last 20 lines into a file with "tail".

tail -n 20 seq1.fasta > smaller2.fasta

What if we want the first 20 lines and the last 20 lines in one file, with the first at the top and the last at the bottom? Use "append, >>" to paste the second file to the bottom of the first file. Let's try it.

head -n 20 sample.fasta > smaller.fasta
tail -n 20 sample.fasta >> smaller.fasta

Keep in mind that if you input into the same file multiple times, you are overwriting the previous contents. For example, what is the final content of our file "covid.fasta"?

head -n 20 seq1.fasta > covid.fasta
head -n 20 seq2.fasta > covid.fasta
head -n 20 seq3.fasta > covid.fasta

How many lines are now in "covid.fasta"? How can you check?

wc covid.fasta

What happened to all of our content? The final results are from "seq3.fasta" only, the other two results files have been overwritten.

So, how would you get all three files into "covid.fasta"? You'll need to use append.

cat seq1.fasta > covid.fasta
cat seq2.fasta >> covid.fasta
cat seq3.fasta >> covid.fasta

How could you test to see if the file has the expected number of lines?

wc covid.fasta

To input into a file

less < covid.fasta

7. Combining commands with pipe (|). Where the heck is pipe anyway?

The hardest thing about using pipe (|) is finding it on your keyboard!

The pipe symbol "|" is way over on the right hand side of your keyboard, above the backslash "\".

Pipe is used to take the output from one command, and use it as input for the next command, all in one command line. Let's look at some examples.

head -n 20 sample.fasta | wc

Using what we've learned, all together now.

Let's say you've got a very large FASTA or FASTQ file, and you want to run an analysis on it. Before working on the whole file, it can be useful to set up a smaller test file instead.

Here's one way to do it.

cat sample.fasta | head -n 20 > output.fasta

This combines several things we have learned about. The "cat" command opens the file "sample.fasta" for writing. The pipe "|" command is used to take that output and run it through the "head" command where we only want to see the first 20 lines, and we want them output ">" into a file called "output.fasta". Let's compare the files. How are they different?

ls -lh

and

less sample.fasta
less output.fasta

8. Finding information in files with grep

The "grep" utility is used to search files looking for a pattern match. It is used like this.

grep pattern options filename

As our first example we will look for restriction enzyme (EcoRI) sites in a sequence file (eco.fasta). The file has four EcoRI sites, but two of them are across the end of the line (and won't be found).

cd /data
ls
grep -n GAATTC eco.fasta

We can modify the "eco.fasta" file to remove the line breaks (\n) at the ends of the lines.

grep -v “>” eco.fasta | tr -d “\n” | grep GAATTC

-v : This prints out all the lines that do not match the pattern

The unix "tr" (translate) command is used for translating or deleting characters.

Usage:

tr [option] set1 [set2]

-d : Deletes characters in the first set from the output

So this part of the command line is finding the line breaks "\n" and removing them.

The header lines (>) can also be removed.

And now we can see all four of the EcoRI sites.

What if we just wanted to count the occurrence of the EcoRI sites in the sequence?

grep -v “>” eco.fasta | tr -d “\n” | grep GAATTC -c

-c : This prints only a count of the lines that match a pattern.

It is counting the entire line as one.

If we wanted to see each of the EcoRI sites listed.

grep -v “>” eco.fasta | tr -d “\n” | grep GAATTC -o

And if we want to count the number of EcoRI sites.

grep -v “>” eco.fasta | tr -d “\n” | grep GAATTC -o | wc

Let's create a word file that we can input to "grep". We can input multiple restriction enzyme sites and search for all of them. We'll need to be in our home dir to do this as we do not have write privileges in /data.

cd
nano wordfile.txt

Put in the words (GAATTC, TTTTT). Now we can use that file to find lines.

grep -v ">" /data/eco.fasta | tr -d "\n" | grep -f wordfile.txt -o | wc

Here is a list of options for grep (from geeksforgeeks.org).

-c : This prints only a count of the lines that match a pattern

-h : Display the matched lines, but do not display the filenames

-i : Ignores, case for matching

-l : Displays list of a filenames only

-n : Display the matched lines and their line numbers

-v : This prints out all the lines that do not matches the pattern

-e exp : Specifies expression with this option. Can use multiple times

-f file : Takes patterns from file, one per line

-E : Treats pattern as an extended regular expression (ERE)

-w : Match whole word

-o : Print only the matched parts of a matching line, with each such part on a separate output line.

-A n : Prints searched line and nlines after the result

-B n : Prints searched line and n line before the result

-C n : Prints searched line and n lines after before the result

There are also existing programs to find motifs (patterns) in sequence data.

Emboss - fuzznuc - a motif finding program

fuzznuc -help
fuzznuc -pattern GAATTC -rformat2 tagseq eco.fasta -outfile /home/username/results

We put the results in our home dir as we do not have write permissions in "/data".

-pattern : Nucleotide pattern we are looking for

-rformat2 : Report format

tagseq : tag sequence

-outfile : name and path to the results file

9. Performing repetitive actions with Unix.

We can create a "for" loop to do iterative actions in Unix.

For each commands all on one line or separate lines: (“i” can be any variable name). These steps can be saved as a file, thereby creating a simple Unix script.

What does this "for loop" do?

 for i in *.fasta; do ls $i; done

What do these command lines do?

This one pulls out all the header ">" lines in the fasta files.

  for i in *.fasta; do echo $i; grep “>” $i; done

While this one just pulls out the ones from files named seq*.fasta.

  for i in seq*.fasta; do echo $i; grep “>” $i; done

10. Permissions - when all else fails check the permissions

Permissions dictate who can access your files and directories, and what actions they can perform. If you are consistently getting error messages from your command line and you're sure you are typing it correctly, it's worthwhile checking the permissions. So what does it all mean? How do we read the permissions information?

In this example, we are looking at a directory (d) and a file(-). Let's look at the directory permissions.

The first letter "d" indicates that this is a directory. The next 3 positions are the owner's permissions. In this example, the owner can "read", "write" and "execute". So they can create files and directories here, read files here, and execute/run programs. The next 3 positions show the permissions for the "group". In this example, "group" members can "read" and "execute" files, but not write/create files. A dash symbol "-" indicates no privileges in that position. The last 3 positions shows permissions for everyone. Here they can "read" and "execute" files.

The number "53" before the user name "NIH\Domain Users" indicates the levels of directories, and right after that is your username, or my username in this case "stonelakeak". Following that is my group "NIH\Domain Users". The file size is next at "1.7K", then there is a time stamp from when the file was last modified. Here it is Apr 5 at 3:46 in the afternoon. And then finally, the last word on the line is the filename. And that's how you read permissions.

ls -lh

drwxr-xr-x   53 stonelakeak  NIH\Domain Users   1.7K Apr  5 15:46 teaching
-rw-r--r--    1 stonelakeak  NIH\Domain Users    18B Apr  6 11:06 wordfile.txt

To change permissions, we use the "chmod" change ownership command.