Lesson 2: Navigating file systems with Unix
Lesson 1 Review
- Biowulf is the high performance computing cluster at NIH.
- When you apply for a Biowulf account you will be issued two primary storage spaces: 1)
/home/$User
and 2)/data/$USER
, with 16 GB and 100 GB of default disk space. - Hundreds of pre-installed bioinformatics programs are available through the
module
system. - Computational tasks on Biowulf should be submitted as a job (
sbatch
,swarm
) or through an interactive sessionsinteractive
. - Connect to Biowulf using HPC OnDemand or
ssh
. - Do not run computational tasks on the login node.
Lesson Objectives
In lesson 1, you learned about the NIH HPC cluster, Biowulf. Biowulf nodes use a Unix-like (Linux) operating system (distributions RHEL8/Rocky8), which requires knowledge and use of the command line interface (shell) to direct computational functionality. The purpose of today's lesson is to get you familiar with working on the command line. To this end, we will...
- Learn the basic structure of a unix command.
- Learn how to navigate your file system, including absolute vs relative directories.
- Learn unix commands related to navigating directories, creating and removing files or directories, and getting help.
A word about mistakes
YOU WILL MAKE MISTAKES...but, it is okay. We all make mistakes, and mistakes are how we learn. Remember, existing safeguards make it nearly impossible for individual Biowulf users to irreparably mess up the system for others. However, you can make your life difficult, for example, by misusing commands, ignoring existing tools, overwriting files, failing to redirect or output results, disregarding warnings, and not seeking help.
How can we overcome mistakes?
We practice. The more you use unix and bash scripting the better you will become.
You will need to learn how to troubleshoot error messages. Often this will involve googling the error in reference to the entered command. There are many forums that post help regarding specific errors (e.g., stack overflow, program repositories such as github).
File system
We manage files and directories through the operating system's file system. A directory is synonymous with a "folder", which is used to organize files, other directories, executables, etc.
On a Windows or Mac, we usually open and scroll through our directories and files using a GUI. For example, Finder is the default file management GUI from which we can access files or deploy programs on a macbook.
This same file system can be accessed and navigated via command line from the unix shell.
Some useful unix commands to navigate our file system and tell us some things about our files
pwd
(print working directory)ls
(list)touch
(creates an empty file)nano
(basic editor for creating small text files)- using the
rm
command to remove files. Be careful! mkdir
(make a directory) andrmdir
(remove a directory, must be empty of all files)cd
(change directory), by itself will take you home,cd ..
(will take you up one directory),cd /results_dir/exp1
(go directly to this directory)mv
(for renaming files or moving files)less
(for viewing files, ormore
)man
(for viewing the man pages when you need help on a command)cp
(copy) for copying files
Getting Started
We have already seen some unix commands relevant to Biowulf. For example, we learned about ssh
. The ssh command is used to securely log into a remote machine and execute commands on that machine.
ssh
is the command and username@biowulf.nih.gov
is a command line argument, where username
is the username that you wish to connect to on the remote system and biowulf.nih.gov
is the hostname of the remote machine. For this lesson and the lessons that follow, we will use NIH HPC student accounts to connect to Biowulf.
More about student accounts
At the beginning of each class you must sign up for a student account. You can sign up for a student account using a Google spreadsheet, the link for which will be supplied at the beginning of each class via Webex. Click on the supplied link, and find an empty slot under the "Name" column. Type your name in the empty slot. The username under "Account Username" will now serve as your username for logging in to Biowulf.
This task will be repeated at the beginning of each lesson to allow students the option of flexible attendance.
Let's go ahead and get connected. Open a terminal and type the following:
username
= NIH/Biowulf login username. Remember to use the student account username here.
Note
If this is your first time logging into Biowulf, you will see a warning statement with a yes/no choice. Type "yes".
Type in your password at the prompt. The cursor will not move as you type your password!
Success
We will connect to Biowulf at the beginning of each session.
We are now on the login node. Remember, you should not do work on the login node. However, you can do basic file management on the login node and edit and compile code. For now, we will stay on the login node.
Our Second Unix Command (ls
)
Let's continue learning about the structure of linux commands using another common command, ls
. The ls
command "lists" the contents of the directory you are in. You may see files and other directories here.
At this point, you are in your home directory, and so you will see whatever files and directories are located here. For example, if I had logged in to my Biowulf account, I would see the following:
However, since this is a student account, I do not see anything, as I have not yet added any files.
How can you tell the difference between a file and a directory?
We can add some additional options (flags) to our command.
will show permissions and indicate directories (d
). The -lh
are flags. -l
refers to listing in long format, while -h
provides human readable file sizes.
Or, many systems offset directories and files using colors (e.g., blue for directories). If you don't see colorized output, try the -G
flag.
We can also label output by adding a marker to indicate files, directories, links, and executables using the -F
flag.
/
= directorya
@
= link a
*
= executable
Anatomy of a command
Using ls
as an example, we can get an idea of the overall structure of a unix command.
Image inspired by "Learn Enough Command Line to Be Dangerous"
The first thing we see is the command line prompt, usually $
or %
, which will vary by operating system. The prompt let's us know that the computer is waiting for a command. Next we see the actual command, in this case, ls
, telling the computer to list the files and directories. Most commands will have various options / flags that can be included to modify the command function. We can also supply an argument, which in the case of ls
is optional. For example, here we supplied an alternative directory from which we are interested in listing files and directories. We hit enter
or return
after each command, and when the command has finished running, the command prompt will reappear prompting us to enter more commands.
Where am I? (pwd
)
pwd
stands for "print working directory". When you run this, you should see something like this.
where username is your name or student account. This is your home directory - where you start from when you open a terminal. This is an example of a "path". The path tells us the location of a file or directory.
Note
While Windows computers use a \
as a path separator, unix systems use a /
.
Therefore, the pwd
command is very helpful for figuring out where you are in the directory structure. If you are getting "file not found" errors while trying to run something, it is a good idea to pwd
and see if you are where you think you are. Type the pwd
command and make a note of the directory you are in.
More on the home directory (~
)
We see that we are in our home directory. But where is that exactly?
The file system on any computer is hierarchical. On a Unix system, the top level of the file system, or root directory, is denoted by /
. All subdirectories on the file system branch from this root directory. See the below example.
Example of file system hierarchy structure.
In our example hierarchy, we have subdirectories /home
and /data
, and within data
, we see additionally subdirectories, P_in
and P_out
. Only the first /
denoted a directory (root). All other /
s in the path serve as separating characters.
Absolute vs Relative directories
A file path that starts with the root or /
is known as an absolute path. A path that does not start with the root directory is called a relative path. For example, in Unix, .
is used to denote here in the present working directory and ..
is used to denote one directory back. Thus, a path that starts with .
or ..
is a relative path. Going back to pwd
. The output (/home/username
) is an absolute file path. Absolute file paths will break scripts when collaborating because the likelihood that your file system matches another's is low.
We can use the tree
command to get an idea of the structure of our home directory on Biowulf.
Getting around the Unix directory structure (cd
)
How do we navigate this directory tree. We use cd
, which means "change directory". Let's change directory to our data directory, which is the larger of the two allocations we are allotted on Biowulf.
cd /data/$USER # (1) change to your data directory
pwd #print working directory
ls #list the contents of /data/$USER
$USER and other environment variables
$USER
is an example of an environment variable.
Environment variables contain user-specific or system-wide values that either reflect simple pieces of information (your username), or lists of useful locations on the file system. --- Griffith Lab
We can display these variables using echo
.
$PATH
is an important environment variable.
This results in a colon separated list of directories containing programs that you can run without specifying those directories each time you run the program.
You will likely need to add to your $PATH
at some point in the future.
To do this use:
export PATH=$PATH:/path/to/folder
This change will not remain when you close the terminal. To permanently add a location to your path, add the above line to your bash shell configuration file, ~/.bashrc
.
By itself, the cd
command takes you home
. Let's try that, and then do a pwd
to see where we are.
We are back in our home directory.
How can we go back to the /data/$USER
directory? We need to give the "path" to that directory.
Check where you are with pwd
and look at the contents of the directory with ls
. What do you see?
Once we create more files and directories, we can learn a bit more about the directory structure and absolute vs relative file paths.
Creating files (touch)
The touch
command creates a file, but the file is empty, so it is not a command you will use very often, but good to know about.
Now we see something like this.
The nano
editor is a text editor useful for small files.
Let's put something in this file.
Unix is an operating system, just like Windows or MacOS.
Linux is a Unix like operating system;
sometimes the names are used interchangeably.
Nano commands for saving your file and exiting nano:
- control O - write file (equivalent to save as)
- File name to write: file2.txt (Hit return/enter on your keyboard to save the file with this name).
- control X - to Exit
This brings us to our next topic which is very important!
Avoid spaces in file names and directories.
There should not be spaces in Unix file names or directories. There are many strategies that can be used to avoid spaces in file names. Though, consistency is key. One good method is using snake_case, in which words are separated by an _
.
For example, we can use the underscore (_) where a space would go, like this, to name a directory for module 1.
To use snake_case with file names, we may see something like this:
The first part of the file name provides info about the file, and the extension (.fastq) tells what kind of file it is. (Examples of file extensions are .csv, .txt, .fastq, .fasta and many more.)
More on file organization to come.
Understanding file extensions
It's important to understand file extensions, to know what kinds of data you are working with.
.txt are text files. These are likely but not always tab delimited.
.tsv are tab delimited files.
.csv are "comma-separated values" - good for importing into MS Excel spreadsheets
.tar.gz indicates a tarred and zipped file - so it is a compressed file
.fastq tells you that these are FASTQ files, containing sequence data and quality scores
.fasta indicates FASTA formatted sequence data, either protein or nucleotide
Removing files with rm
Warning
A Unix system will delete something when you ask to delete it and there is usually no way of getting it back. Be extremely careful when removing files and directories.
By adding the -i
option, the system will ask if you're sure you want to delete. Generally speaking, when a file on a Unix system is deleted, it is gone.
will remove a file we created.
Creating directories (mkdir
)
A couple things to note - this is a good time to give your directories meaningful names, which will help you keep things organized. Organization is key. I generally like to have a new directory per project, and within that directory, subdirectories separating raw data from analysis files. From there, each analysis would also get its own subdirectory. However, there are many ways to organize files and you should do whatever makes sense for your data and helps you (and others) stay organized.
Keep raw data raw
Always keep your raw data raw, and save outputs to new files. Do not overwrite raw data! Consider setting the permissions on these files to "read only". More on permissions later.
For now, let's create a directory called Module_1
, where we can store Module 1 lesson content. To create a directory, we use mkdir
.
Removing directories (rmdir
)
Directories must be completely empty of all files and other contents before you can delete them with rmdir
. There are ways to "recursively" remove files and directories using the -r
option of rm
, for example (rm -r directory
). This would remove all of the files and subdirectories in our hypothetical directory
. Keep in mind that once these files are deleted they are gone for good. Be extremely careful with the -r
option. As beginners it can be safer to navigate to the directory and remove content directly.
Let's take a quick second to apply some of the things we have learned and create more content to work with.
Navigate to the directory we just made (Module_1
), and make another directory within it called directory_to_delete
.
cd Module_1 #change directory
pwd #print working directory (wd)
mkdir directory_to_delete #make a new directory
ls #list the contents of wd
directory_to_delete
and create a file, myseq.txt
.
Let's check the contents of our directory and see where we are located in our directory tree.
To summarize what we have done:We've moved to the Module_1
directory, checked our directory with pwd
, created a directory called directory_to_delete
, and listed the contents of Module_1
, so we can see the directory we just created. We then navigated to directory_to_delete
using cd
, created a file with touch
, listed the contents with ls
, and printed our working directory (pwd
).
Let's move up one directory back to Module_1
.
Test Your Knowledge: Question 1
How could you move back to your home directory?
Test Your Knowledge: Question 2
If you changed to your home directory, how can you return to directory_to_delete
?
Answer Question 2
We need to give the "path" to that directory.
Getting back to removing directories (rmdir
)
Now that we have created some directories, let's use rmdir
to remove one.
What do you see when you try to remove this directory?
What should we do? We need to remove the contents of a directory before we can remove the directory. Here's one safe option.
Moving and renaming files and directories, all with one command (mv
)
The mv
command is a handy way to rename files if you've created them with a typo or decide to use a more descriptive name. For example:
Be careful when moving files, a mistake in the command can yield unexpected results.
The -i
interactive option will help keep you safe.
For example:
Copying files (cp
)
This is similar to mv
but will create an actual copy of a file. You will need to specify what you are copying (the source) and where you want to make the copy (the target).
For example, let's copy a file from the BTEP teaching materials to Module_1
.
Remember, the .
is a relative path shortcut denoting our current directory, so we are copying this into our current working directory.
*
is a wildcard, matching zero or more characters including spaces. We are using this to copy two files that differ in the last letter of the file extension. More on these in Lesson 3.
We can also copy an entire directory using the recursive flag (cp -r
). For example, let's copy the directory Practice_Sessions
from the BTEP teaching materials to our current directory.
Viewing file content
There are several ways to view files. We can use the less
command to view the contents of a file like this.
You'll need to type q
to get out of less
and back to the command line. Before the less
command was available, the more
command was commonly used to look at file content. The less
command has more options for scrolling through files, so it is now the preferred command.
Another command for reading files is cat
, but this will print the contents in their entirety.
Help! (man
)
Lastly, all Unix commands have a man
or "manual" page that describes how to use them. If you need help remembering how to use the command ls
, you would type:
To exit man
, again use q
.
There are quite a few flags/options that we can use with the ls
command, and we can learn all about them on the man page. My favorite flags for ls
are -l
and -h
. We will use flags often, and you won't get far in Unix without knowing about them. Try this:
We have already seen these flags, but as a reminder...
-h
when used with the -l
option, use unit suffixes (Byte, Kilobyte, Megabyte, Gigabyte, Terabyte and Petabyte) in order to reduce the number of digits to three or less using base 2 for sizes.
-l (The lowercase letter "ell".) List in long format. (See below). If the output is to a terminal, a total sum for all the file sizes is output on a line before the long listing.
Additional Resources
Software Carpentry: The Unix Shell
Help Session
Practice navigating the file system and creating files. Instructions are here.