Skip to content

Lesson 1: Overview of Unix command line and signing onto Biowulf

Learning objectives

After this lesson, participants will be able to

  • Describe the Unix operating system
  • Describe Biowulf
  • Connect onto Biowulf via local computer

Overview of Unix

In Windows and MacOS, we interact with the computer through a graphical user interface (GUI). On the contrary, in Unix, we interact with the computer by typing commands.

Basic Unix command syntax

The Unix command syntax is composed of

  • The command
  • Option(s) that will alter how a command functions
  • Argument(s), what you want the command to operate on
command options argument

For instance, to make a new folder in Unix, we use the command mkdir . Here, we enter the command followed by the argument(s) that we want the command to operate on. In this case, the argument is the name of the folder that we would like to create. This is different from the graphical based approach that we use to create new folders in Windows or MacOS

mkdir new_folder

Above, we just learned our first Unix command, which is just one of many. Before moving further, we should clarify the rationale for using Unix. While there is a steep learning curve, once we have mastered working in Unix, we can perform many of our computing processes. Unix allows for easy file management, editing of text files, and allows us to view tabular data that is too large for Excel. Further, many of the applications used in bioinformatics are made to work in Unix.

Overview of Biowulf

Biowulf is the high performance and Unix-based computing system at NIH. Below are some rationale for using Biowulf.

  • Biowulf offers more computing power and space for data storage compared to our local machine.
  • Biowulf also houses many applications for bioinformatics, which are installed and updated by their staff.
  • The GUI-based bioinformatics package, Partek Flow runs on Biowulf.

Visit https://hpc.nih.gov/docs/accounts.html to learn how to obtain a Biowulf account.

Figure 1 shows the an example of high performance computing clusters hierarchy. This is useful to know so that we know what we are asking for when requesting compute resources.

Figure 1: In Biowulf, many computers make up a cluster. Each individual computer or node has disk space for storage and random access memory (RAM) for running tasks. The individual computer is composed of processors, which are further divided into cores, and cores are divided into CPUs. In this example, the individual computer has 2 processors, 4 cores, and 8 CPUs.

Biowulf student accounts

For this course series, participants will be using one of the student accounts (see student assignments) provided by Biowulf staff. See the course overview section student account ID assignment.

Signing onto Biowulf

When working on Biowulf, we are working on a remote computer; thus, we need a way to connect to it. We can use Secure Shell Protocol (ssh) to connect to Biowulf. When connecting to Biowulf, we need to either be connected to the NIH network by being on campus or via VPN.

Signing onto Biowulf with a PC

For those using Windows 10 or newer, ssh is built into the command prompt (Figure 2 and Figure 3).

Figure 2: At the search box next to the Windows start menu, type cmd and click on the command prompt application.

Figure 3: When the command prompt opens, you can type ssh to confirm that it is available

Signing onto Biowulf with a Mac

The best way to sign onto Biowulf from a Mac is to use the built-in terminal (Figure 4). Use the Spot Light search at the Mac menu bar to search for the Terminal application. Click on it to open the Terminal.

Figure 4: Use the Mac Spot Light search to find the Terminal.

Connect to Biowulf

Remember that if you are not on campus, then you need to connect to the NIH network through VPN. Regardless whether you are using the Windows Command Prompt or Mac Terminal, the construct for ssh to connect to Biowulf is (see Figure 5).

The username in the ssh command is either

  • your NIH username if you are using your own Biowulf account for this course OR
  • one of the student accounts
ssh username@biowulf.nih.gov

For first time users, when connecting you may see the message below. Respond with yes.

The authenticity of host 'biowulf.nih.gov (128.231.2.9)' can't be established. ECDSA key fingerprint is SHA256:BoP/KLS17g+gUuQ7mrCHa9oPPO+MHi/h8WML44iA1dw. Are you sure you want to continue connecting (yes/no)? yes

Next, you will see a message warning you that you are accessing a government computer system and that you should not do anything suspicious. At the end of the message, you will be asked to enter your password, which is either your NIH password (if you are using your own Biowulf account) or the password for the student accounts. The cursor will not move and nothing will be displayed when entering your password, but keep typing.

Warning: Permanently added 'biowulf.nih.gov' (ED25519) to the
list of known hosts.
                           ***WARNING***

You are accessing a U.S. Government information system, which 
includes (1) this computer, (2) this computer network, (3) all 
computers connected to this network, and (4) all devices 
and storage media attached to this network or to a computer on 
this network. This information system is provided for U.S.  
Government-authorized use only.

Unauthorized or improper use of this system may result in 
disciplinary action, as well as civil and criminal penalties.

By using this information system, you understand and consent to the
following:

* You have no reasonable expectation of privacy regarding any
communications or data transiting or stored on this information 
system. At any time, and for any lawful Government purpose, 
the government may monitor, intercept, record, and search and
seize any communication or data transiting or stored on this 
information system.

* Any communication or data transiting or stored on this information
system may be disclosed or used for any lawful Government purpose.

--
Notice to users:  This system is rebooted for patches and 
maintenance on the first Sunday of every month at 8:00 pm unless 
Monday is a holiday, in which case it is rebooted the following 
Sunday evening at 8:00 pm.  Running cluster jobs are not 
affected by the monthly reboot.

username@biowulf.nih.gov's password:

You will be taken to the prompt after successfully entering your password (see below). It is at the prompt where we type commands and interact with Biowulf. Again, replace username with the student ID in which you were assigned.

[username@biowulf ~]$

Finding group affiliation on a high performance computing cluster

The id command informs groups that the user might be affiliated with. This is important when collaborating with others Biowulf such that our affiliation with groups will indicate that we have access to the data.

id

Running the id command we see my user id (uid) and primary group id (gid). We also see that I am a part of the GAU and LCP_Omics groups.

uid=58740(wuz8) gid=58740(wuz8) groups=58740(wuz8),57888(GAU)

Log-in node

Upon signing onto Biowulf, users will land in the log-in node. Later on in this series, compute nodes will be introduced but essentially, the log-in nodes should not be used to perform compute intensive tasks.

Definition

"The log in node is your point of access to the Biowulf cluster" -- Biowulf accounts and log in node

The log in node is meant for the following (Source: Biowulf accounts and log in node)

  • Submitting jobs (main purpose)
  • Editing/compiling code
  • File management
  • File transfer
  • Brief testing of code or debugging (under 20 minutes)

Biowulf directory spaces

Upon signing onto Biowulf, users will land in the home directory, which is denoted by home/username or ~. Again, replace username with your assigned student ID or NIH username when you get a personal Biowulf account.

Note

"Each user has a home directory called /home/username which is accessible from every HPC system. The /home area has a quota of 16 GB which cannot be increased. It is commonly used for configuration files (aka dotfiles), code, notes, executables, state files, and caches." -- Biowulf.

The pwd command is used to find which directory the user is currently in.

pwd

Again, upon log into Biowulf, the current directory should be home.

/home/username

Data directory

The data directory is much larger and quota can be increased. The path to the data directory is /data/username. To change in to the data directory use the following. The data directory to can be used to store analysis input and output.

cd /data/username

lscratch

In Biowulf, lscratch is local storage space available on individual nodes. This can be helpful and used for jobs that read or write a lot of temporary files. We will further discuss lscratch in a future lesson.

Scratch

The scratch area is a shared storage space accessible to users for storing temporary files. The path to this is /scratch/username. A word of caution is that files in scratch are deleted after 10 days. While each user can store up to 10 TB (terabyte) of data in scratch, it is not guaranteed that this amount will always be available. Finally, Biowulf staff will delete files if scratch becomes more than 80% full.

Snapshots

When working in Unix, we need to keep in mind that there is no Recycling Bin (Windows) or Trash can (Mac) that hold deleted items and allow us to recover it. Once we delete something in Unix, it is gone. Fortunately, Biowulf keeps snapshots, which are read-only copy of data at a certain time and we can use these to restore content that we deleted. See here for snapshots on Biowulf. To change in the snapshot directory from the data folder, use the following.

cd /data/username/.snapshot

The home directory snapshot is located at /home/username/.snapshot.