Lesson 3: Working with files and directories, interactive sessions, and exploring Next Generation Sequencing data

Quick review

Lesson 2 introduced the Biowulf directory structure and distinguished the difference between the home and data directories. It introduced commands for

listing the present working directory (pwd)
listing directory content in short format (ls)
listing directory content in detailed format (ls -l)
moving from one directory to another (cd)
copying (cp -r) of a folder.

Learning objectives

After this lesson, the learner should be able to

Make new directories
Move and rename files and folders
Delete files
Search for patterns in files
Describe the difference between the Biowulf log-in node and compute nodes
Request an interactive session
Explore next generation sequencing data using Unix commands

Commands that will be discussed

mkdir: make new directory
mv: move or rename file or directory
rm: delete
sinteractive: request an interactive session
head: view content at the beginning of a file
tail: view content at the end of a file
less: page through a file
grep: search for pattern in a file
wc: word count

Before getting started

Sign onto Biowulf using the assigned student account. Remember, Windows users will need to open the Command Prompt and Mac users will need to open the Terminal. Also remember to connect to the NIH network either by being on campus or through VPN before attempting to sign in. The command to sign in to Biowulf is below, where username should be replaced by the student ID.

ssh username@biowulf.nih.gov

See here for student account assignment. Enter NIH credentials to see the student account assignment sheet after clicking the link.

After connecting to Biowulf, change into the data directory. Again, replace username with the student account ID.

cd /data/username

Biowulf does not keep data in the student accounts after class, so copy the folder unix_on_biowulf_2023_documents in /data/classes/BTEP to the present working directory, which should be /data/username.

cp -r /data/classes/BTEP/unix_on_biowulf_2023_documents .

Change into unix_on_biowulf_2023_documents.

cd unix_on_biowulf_2023_documents

Next, change into unix_on_biowulf_2023.

cd unix_on_biowulf_2023

Make a new directory called lesson3

mkdir lesson3

Moving files

Move the file counts.csv to lesson3 using the mv command, where the arguments are

item to move (ie. the file counts.csv)
destination (ie. the folder lesson3)

mv counts.csv lesson3

List the contents of the lesson3 folder to make sure the counts.csv file was successfully moved.

ls lesson3/

counts.csv

Move the counts.csv file from the folder lesson3 back to the present working directory, which should be /data/username/unix_on_biowulf_2023_documents/unix_on_biowulf_2023.

mv lesson3/counts.csv .

Use the ls command to list the content of the present working directory (/data/username/unix_on_biowulf_2023_documents/unix_on_biowulf_2023) to ensure that counts.csv has been moved back.

ls

counts.csv  lesson3  results.csv  text_1.txt

Moving folders

The mv command can also be used to move a folder to another. To demonstrate this, make a copy of the folder lesson3 and call it lesson3_copy.

cp -r lesson3 lesson3_copy

Move lesson3_copy to lesson3

mv lesson3_copy lesson3

List the content of the lesson3 folder to confirm that lesson3_copy has been moved successfully.

ls lesson3

lesson3_copy

Renaming files

Rename results.csv to deg_results.csv.

mv results.csv deg_results.csv

Starting an interactive session

Upon logging on to Biowulf, the user is taken to a log in node, which should not be used for computation intensive tasks. To perform computation intensive tasks, the user should work on a compute node.

To request an interactive session

sinteractive

The jobhist command can be used to look at compute allocations for an interactive session. The argument for jobhist is the job id, which can be referenced using the variable SLURM_JOBID. Note that "$" is used to reference variables in Unix.

jobhist $SLURM_JOBID

JobId              : 65090666
User               : wuz8
Submitted          : 20230509 15:40:56
Started            : 20230509 15:45:46
Ended              : 

Jobid        Partition       State  Nodes  CPUs      Walltime       Runtime         MemReq  MemUsed  Nodelist
65090666    interactive     RUNNING      1     2       8:00:00          2:48            2GB      1MB  cn4280

Note

The default sinteractive allocation is 1 core (2 CPUs) and 0.768 GB/CPU of memory and a walltime of 8 hours. Resource allocations can be adjusted depending on the task.

Working with next generation sequencing files

Go back up one directory to the unix_on_biowulf_2023_documents folder and then change into the SRR1553606 folder.

cd ..

cd SRR1553606

There are two next generation sequencing data files (fastq files) in this folder.

ls

SRR1553606_1.fastq  SRR1553606_2.fastq

It is possible to use Unix commands to learn about the content of fastq files prior to analyzing them with more advanced tools that are available on Biowulf.

The head command will print out the first 10 rows (default) of files in Unix and can be applied to fastq files.

head SRR1553606_1.fastq

ATACACATCTCCGAGCCCACGAGACCTCTCTACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACAGGAGTCGCCCAGCCCTGCTCAACGAGCTGCAG
+SRR1553606.1 1 length=101
@@@FDFDFHHHHHIJGIJJHHIIJIGHGHIJGFI9DDFH?FFHIGGGH>EHGIJEECCABBDABD####################################
@SRR1553606.2 2 length=101
CAACAACAACACTCATCACCAAGATACCGGAGAAGAGAGTGCCAGCAGCGGGAAGCTAGGCTTAATTACCAATACTATTGCTGGAGTAGCAGGACTGATCA
+SRR1553606.2 2 length=101
CCCFFFFFHHGHHJJIJJJJIJJIDIJJJJHIGIJJJJIFHIJJJJJIJJJJGFFFDEEEEDDDDEEECDDDDDDDCEDDCCDDDD>CDDDDDDDDDDDCE
@SRR1553606.3 3 length=101
CTTGCATACTGCACTGGATTGAATTGCGGGACGGTCTGGATCGTCAGGCGCTCGATATTCCACGCTGCGCTCTTGGCGTTCCATTCGCAGTTATCGTGAAA

To print out more or less than the default 10 lines, use the -n option followed by the number of lines desired. The head command below will print the first sequence of the fastq file.

head -n 4 SRR1553606_1.fastq

@SRR1553606.1 1 length=101
ATACACATCTCCGAGCCCACGAGACCTCTCTACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACAGGAGTCGCCCAGCCCTGCTCAACGAGCTGCAG
+SRR1553606.1 1 length=101
@@@FDFDFHHHHHIJGIJJHHIIJIGHGHIJGFI9DDFH?FFHIGGGH>EHGIJEECCABBDABD####################################

The tail command can be used to view contents at the end of documents. By default it shows the last 10 lines and the -n option can be used to change the default behavior.

tail SRR1553606_1.fastq

There is also the less command, which allows for paging up and down through file contents. Hit the up and down arrow keys to scroll up and down line by line or hit the space bar to scroll down page by page. Hit q to exit the less command.

less SRR1553606_1.fastq

The grep command is used to search for patterns with in files. For instance, we can search for the sequencing data header that corresponds to every sequence in a fastq file.

grep @SRR1553606 SRR1553606_1.fastq

Because the each sequence in a fastq file has a header line, it follows that searching for and then counting the occurence of the header lines is a plausible way to obtain the number of sequences. Again, grep can be used to search for the header and then the pipe or "|" can be used to send the results to wc -l to count the number of lines. The wc command can be used to count number of characters and words in a file in addition to the number of lines.

grep @SRR1553606 SRR1553606_1.fastq | wc -l

Deleting files or folders

Go back up one folder to unix_on_biowulf_2023_documents.

cd /data/username/unix_on_biowulf_2023_documents

Make a copy of SRP045416.swarm and call it SRP045416_copy_1.swarm

cp SRP045416.swarm SRP045416_copy_1.swarm

List the contents of the present working directory (/data/username/unix_on_biowulf_2023_documents) to confirm the existence of SRP045416_copy_1.swarm.

ls

SRP045416_copy_1.swarm  SRR1553606        unix_on_biowulf_2023.zip
SRP045416.swarm     unix_on_biowulf_2023

Delete SRP045416_copy_1.swarm.

rm SRP045416_copy_1.swarm

To remove folders, use rm with the -r option.