Lesson 3: Working with files and directories, interactive sessions, and exploring Next Generation Sequencing data
Quick review
Lesson 2 introduced the Biowulf directory structure and distinguished the difference between the home and data directories. It introduced commands for
- listing the present working directory (
pwd
) - listing directory content in short format (
ls
) - listing directory content in detailed format (
ls -l
) - moving from one directory to another (
cd
) - copying (
cp -r
) of a folder.
Learning objectives
After this lesson, the learner should be able to
- Make new directories
- Move and rename files and folders
- Delete files
- Search for patterns in files
- Describe the difference between the Biowulf log-in node and compute nodes
- Request an interactive session
- Explore next generation sequencing data using Unix commands
Commands that will be discussed
mkdir
: make new directorymv
: move or rename file or directoryrm
: deletesinteractive
: request an interactive sessionhead
: view content at the beginning of a filetail
: view content at the end of a fileless
: page through a filegrep
: search for pattern in a filewc
: word count
Before getting started
Sign onto Biowulf using the assigned student account. Remember, Windows users will need to open the Command Prompt and Mac users will need to open the Terminal. Also remember to connect to the NIH network either by being on campus or through VPN before attempting to sign in. The command to sign in to Biowulf is below, where username should be replaced by the student ID.
ssh username@biowulf.nih.gov
See here for student account assignment. Enter NIH credentials to see the student account assignment sheet after clicking the link.
After connecting to Biowulf, change into the data directory. Again, replace username with the student account ID.
cd /data/username
Biowulf does not keep data in the student accounts after class, so copy the folder unix_on_biowulf_2023_documents in /data/classes/BTEP to the present working directory, which should be /data/username.
cp -r /data/classes/BTEP/unix_on_biowulf_2023_documents .
Change into unix_on_biowulf_2023_documents.
cd unix_on_biowulf_2023_documents
Next, change into unix_on_biowulf_2023.
cd unix_on_biowulf_2023
Make a new directory called lesson3
mkdir lesson3
Moving files
Move the file counts.csv to lesson3 using the mv
command, where the arguments are
- item to move (ie. the file counts.csv)
- destination (ie. the folder lesson3)
mv counts.csv lesson3
Move the counts.csv file from the folder lesson3 back to the present working directory, which should be /data/username/unix_on_biowulf_2023_documents/unix_on_biowulf_2023.
mv lesson3/counts.csv .
Moving folders
The mv
command can also be used to move a folder to another. To demonstrate this, make a copy of the folder lesson3 and call it lesson3_copy.
cp -r lesson3 lesson3_copy
Move lesson3_copy to lesson3
mv lesson3_copy lesson3
Renaming files
Rename results.csv to deg_results.csv.
mv results.csv deg_results.csv
Starting an interactive session
Upon logging on to Biowulf, the user is taken to a log in node, which should not be used for computation intensive tasks. To perform computation intensive tasks, the user should work on a compute node.
To request an interactive session
sinteractive
The jobhist
command can be used to look at compute allocations for an interactive session. The argument for jobhist
is the job id, which can be referenced using the variable SLURM_JOBID. Note that "$" is used to reference variables in Unix.
jobhist $SLURM_JOBID
JobId : 65090666
User : wuz8
Submitted : 20230509 15:40:56
Started : 20230509 15:45:46
Ended :
Jobid Partition State Nodes CPUs Walltime Runtime MemReq MemUsed Nodelist
65090666 interactive RUNNING 1 2 8:00:00 2:48 2GB 1MB cn4280
Note
The default sinteractive allocation is 1 core (2 CPUs) and 0.768 GB/CPU of memory and a walltime of 8 hours. Resource allocations can be adjusted depending on the task.
Working with next generation sequencing files
Go back up one directory to the unix_on_biowulf_2023_documents folder and then change into the SRR1553606 folder.
cd ..
cd SRR1553606
There are two next generation sequencing data files (fastq files) in this folder.
ls
SRR1553606_1.fastq SRR1553606_2.fastq
It is possible to use Unix commands to learn about the content of fastq files prior to analyzing them with more advanced tools that are available on Biowulf.
The head
command will print out the first 10 rows (default) of files in Unix and can be applied to fastq files.
head SRR1553606_1.fastq
ATACACATCTCCGAGCCCACGAGACCTCTCTACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACAGGAGTCGCCCAGCCCTGCTCAACGAGCTGCAG
+SRR1553606.1 1 length=101
@@@FDFDFHHHHHIJGIJJHHIIJIGHGHIJGFI9DDFH?FFHIGGGH>EHGIJEECCABBDABD####################################
@SRR1553606.2 2 length=101
CAACAACAACACTCATCACCAAGATACCGGAGAAGAGAGTGCCAGCAGCGGGAAGCTAGGCTTAATTACCAATACTATTGCTGGAGTAGCAGGACTGATCA
+SRR1553606.2 2 length=101
CCCFFFFFHHGHHJJIJJJJIJJIDIJJJJHIGIJJJJIFHIJJJJJIJJJJGFFFDEEEEDDDDEEECDDDDDDDCEDDCCDDDD>CDDDDDDDDDDDCE
@SRR1553606.3 3 length=101
CTTGCATACTGCACTGGATTGAATTGCGGGACGGTCTGGATCGTCAGGCGCTCGATATTCCACGCTGCGCTCTTGGCGTTCCATTCGCAGTTATCGTGAAA
To print out more or less than the default 10 lines, use the -n
option followed by the number of lines desired. The head
command below will print the first sequence of the fastq file.
head -n 4 SRR1553606_1.fastq
@SRR1553606.1 1 length=101
ATACACATCTCCGAGCCCACGAGACCTCTCTACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACAGGAGTCGCCCAGCCCTGCTCAACGAGCTGCAG
+SRR1553606.1 1 length=101
@@@FDFDFHHHHHIJGIJJHHIIJIGHGHIJGFI9DDFH?FFHIGGGH>EHGIJEECCABBDABD####################################
The tail
command can be used to view contents at the end of documents. By default it shows the last 10 lines and the -n
option can be used to change the default behavior.
tail SRR1553606_1.fastq
There is also the less
command, which allows for paging up and down through file contents. Hit the up and down arrow keys to scroll up and down line by line or hit the space bar to scroll down page by page. Hit q
to exit the less
command.
less SRR1553606_1.fastq
The grep
command is used to search for patterns with in files. For instance, we can search for the sequencing data header that corresponds to every sequence in a fastq file.
grep @SRR1553606 SRR1553606_1.fastq
Because the each sequence in a fastq file has a header line, it follows that searching for and then counting the occurence of the header lines is a plausible way to obtain the number of sequences. Again, grep
can be used to search for the
header and then the pipe or "|" can be used to send the results to wc -l
to count the number of lines. The wc
command can be used to count number of characters and words in a file in addition to the number of lines.
grep @SRR1553606 SRR1553606_1.fastq | wc -l
Deleting files or folders
Go back up one folder to unix_on_biowulf_2023_documents.
cd /data/username/unix_on_biowulf_2023_documents
Make a copy of SRP045416.swarm and call it SRP045416_copy_1.swarm
cp SRP045416.swarm SRP045416_copy_1.swarm
Delete SRP045416_copy_1.swarm.
rm SRP045416_copy_1.swarm
To remove folders, use rm
with the -r
option.