Lesson 7: Downloading data, viewing file content, and data wrangling in Unix

Quick review:

In this course series, we have learned how to connect to and navigate around Biowulf. In addition, we have learned how to use applications installed on Biowulf to download sequencing data from the SRA (using fastq-dump) and subsequently, assess the quality of the downloaded sequencing data (using fastqc). Further, we learned to transfer files from Biowulf to our local computer (using scp). Finally, we learned to request an interactive session (using sinteractive) or submit a batch job (using sbatch) to perform compute intensive tasks.

Lesson objectives:

After this lesson, we should

Be able to download data from the web
Know how to view file content
Know how to perform pattern search

Unix commands that we will learn in this lesson

wget (to download data from the web)
curl (to download data from the web)
tar (to unpack tape archives)
unzip (to unpack zipped files)
cat (to display file content)
head (to display beginning of file content; defaults to first 10 lines)
tail (to display end of file content; defaults to last 10 lines)
zcat (to display compressed file content)
less and more (to scroll through files)
grep (to search for patterns)

Downloading data from URL

In Unix, we can use wget or curl to download data from URL.

As an example, let's download the sequncing data (FASTQ) files for the Human Brain Reference (HBR) and Universal Human Reference (UHR) from the Griffith lab RNA sequencing tutorial. You can read more about the HBR-UHR dataset on that page.

Before getting started, let's use pwd to make sure that we are in our data directory

pwd

/data/username

If not change into it

cd /data/username

The next step is to create a folder to store the HBR-UHR sequencing data. Let's call this folder hbr_uhr_rna_sequencing. To create this folder, we will use the mkdir command.

mkdir hbr_uhr_rna_sequencing

After creating hbr_uhr_rna_sequencing, change into it.

cd hbr_uhr_rna_sequencing

To download the FASTQ files for the HBR-UHR dataset, type wget at the command line followed by the URL of the dataset, which is http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar.

wget http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar

During the download process, we will see a download progress bar in the terminal (Figure 1).

Figure 1: Unix wget download progress.

Now, if we list the contents of the hbr_uhr_rna_sequencing folder

ls -l

We will see a tape archive or ".tar" file that we need to unpack to get to the HBR-UHR sequences. More on ".tar" files in a bit.

-rw-r-----. 1 wuz8 wuz8 116602880 Oct 23  2018 HBR_UHR_ERCC_ds_5pc.tar

First, let's remove HBR_UHR_ERCC_ds_5pc.tar and download it again using curl.

To delete a file, recall that we use rm followed by the name of the file that we want to delete.

rm HBR_UHR_ERCC_ds_5pc.tar

With the curl command, we need to specify an output file name. We see two options for specifying the name of the output file with curl if we look into help documents.

curl --help

-o, --output FILE   Write output to <file> instead of stdout
-O, --remote-name   Write output to a file named as the remote file

curl -O  http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar

Listing the contents of the hbr_uhr_rna_sequencing directory, we see that the file HBR_UHR_ERCC_ds_5pc.tar appears when we use curl with the -O option, which writes a file that has the same name as that from the URL (ie. HBR_UHR_ERCC_ds_5pc.tar).

ls -l

-rw-r-----. 1 wuz8 wuz8 116602880 Jan  7 12:10 HBR_UHR_ERCC_ds_5pc.tar

Let's try downloading using curl but specifying a file name of our choice using the -o (lower case o) option. We will name the file HBR_UHR_READS.tar.

curl http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar -o HBR_UHR_READS.tar

Listing the contents of the hbr_uhr_rna_sequencing folder, we see that in addition to the file HBR_UHR_ERCC_ds_5pc.tar, which we downloaded using curl -O, we have the file HBR_UHR_READS.tar, which we downloaded using curl with the -o option (where we specified an output file name of our choice, rather than the one provided by the URL).

ls -l

-rw-r-----. 1 wuz8 wuz8 116602880 Jan  7 12:10 HBR_UHR_ERCC_ds_5pc.tar
-rw-r-----. 1 wuz8 wuz8 116602880 Jan  7 12:16 HBR_UHR_READS.tar

Let's go ahead and remove HBR_UHR_ERCC_ds_5pc.tar because it is the same as HBR_UHR_READS.tar.

rm HBR_UHR_ERCC_ds_5pc.tar

Tar files and how to unpack them

Earlier, we mentioned that the ".tar" extension stands for Tape Archive. Tape Archive allows us to package many files and folders into a single file for easy transfer and sharing. We use the tar command to unpack these files. Options for the tar command can be found by using the command below.

tar --help

The options that we will use for unpacking are below. Note that we can use a single "-" to string together options in Unix commands.

-x, --extract, --get       extract files from an archive
-v, --verbose              verbosely list files processed
-f, --file=ARCHIVE         use archive file or device ARCHIVE

tar -xvf HBR_UHR_READS.tar

Because we included the -v option in the tar command above, we see the files that are unpacked as the command runs.

HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
UHR_Rvep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz

Note that sequencing data from the HBR-UHR dataset come as FASTQ files but are g-zipped (.gz). We will learn how to view data in the Unix terminal and point out how to view FASTQ files that are g-zipped without having to unzip them. Note that some bioinformatics software such as FASTQC can take g-zipped FASTQ files (ie. fastq.gz) as input. Check with the help documents for each software to find out.

Viewing file content in Unix

This portion of the lesson will focus on viewing contents of files in Unix. We will focus on the three types of files below.

Plain text
Tabular data (ie. data files that have many columns and many rows, like a matrix); these data tables can have columns that are
- comma separated (csv)
- tab separated (these will come in the form of txt files)
FASTQ files, which contain high throughput sequencing data

Viewing plain text files in Unix

For this portion of the lesson, let's change back into our data directory. Again, username is the username you used to sign into Biowulf, this could be your NIH username if you have Biowulf a account or one of the student accounts that were setup for us.

cd /data/username

Next, we are going to the course documents and use wget to grab the file unix_on_biowulf_2023.zip (this is under the section labeled Course data in the course documents)

wget https://btep.ccr.cancer.gov/docs/unix-on-biowulf-2023/data/unix_on_biowulf_2023.zip

We will then use unzip to unpack the contents of unix_on_biowulf_2023.zip.

unzip unix_on_biowulf_2023.zip

Note that we get a status of what is being unpacked as the unzipping occurs.

Archive:  unix_on_biowulf_2023.zip
   creating: unix_on_biowulf_2023/
  inflating: unix_on_biowulf_2023/text_1.txt  
  inflating: unix_on_biowulf_2023/counts.csv  
  inflating: unix_on_biowulf_2023/results.csv

Listing the contents of our data folder, we will see a new folder called unix_on_biowulf_2023 (let's change into this).

ls -l

drwxr-xr-x.  2 wuz8 wuz8        4096 Jan  2 14:44 unix_on_biowulf_2023

cd unix_on_biowulf_2023

We will list the contents of the unix_on_biowulf_2023 directory to see what we have to work with.

ls

We have a gene expression counts table from the HBR-UHR dataset (counts.csv), the differential expression analysis results from the HBR-UHR dataset (results.csv), and a random text file (text_1.txt).

counts.csv  results.csv  text_1.txt

Let's see what is in text_1.txt by using cat

cat text_1.txt

oranges
blue
bananas
cats
dogs
apple
florida
gators
gainesville
alachua
county
btep

The head command can be used to view the top several lines of a file (default is 10 lines). We can use the -n option to specify how many lines we want (for instance -n 5 will show the first five lines).

head -n 5 text_1.txt

oranges
blue
bananas
cats
dogs

Opposite of head, tail will show the bottom 10 lines of a file by default. Again, we can use -n to specify the number lines other the default.

tail -n 5 text_1.txt

gators
gainesville
alachua
county
btep

We can use the zcat command to view contents of compressed files without uncompressing them. For instance, the FASTQ files that we downloaded for the HBR-UHR dataset. We will stay in the unix_on_biowulf_2023 directory for this but will append the "../hbr_uhr_rna_sequencing" to reference the directory in which the FASTQ files are in. ".." tells Unix to go up one directory and then look in the folder hbr_uhr_rna_sequencing. We will use "|" or the pipe to send the output of zcat to the head -n 4 command, to get the first four lines of the FASTQ file HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz.

zcat ../hbr_uhr_rna_sequencing/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz | head -n 4

The FASTQ file contains many sequencing reads and these come in 4 lines each, which are

Metadata header that starts with "@"
Actual sequence
"+"
Error likelihood of each of bases along the read

@HWI-ST718_146963544:7:2201:16660:89809/1
CAAAGAGAGAAAGAAAAGTCAATGATTTTATAGCCAGGCAAAATGACTTTCAAGTAAAAAATATAAAGCACCTTACAAACTAGTATCAAAATGCATTTCT
+
CCCFFFFFHHHHHJJJJJHIHIJJIJJJJJJJJJJJJIJJJJJJJJJJJJJIJJIIJJJJJJJJJJJJIIJFHHHEFFFFFEEEEEEEDDDDCDDEEDEE

For larger files that have a lot of rows, we can use less to scroll. For instance, if we cat counts.csv, it will print the entire file to the terminal. So to view parts of it at a time while being able to scroll we can use the less command, which is known as a terminal pager. Note that we can use the down arrow to scroll down a file and the up arrow to scroll up when using less.

less counts.csv

The counts.csv file is a gene expression counts table and has seven columns, where the first column contains gene IDs. Note that the columns are separated by commas as suggested by the ".csv" extension.

Hit q to get out of less and return to the prompt.

Geneid,HBR_1.bam,HBR_2.bam,HBR_3.bam,UHR_1.bam,UHR_2.bam,UHR_3.bam
U2,0,0,0,0,0,0
CU459211.1,0,0,0,0,0,0
CU104787.1,0,0,0,0,0,0
BAGE5,0,0,0,0,0,0
ACTR3BP6,0,0,0,0,0,0
5_8S_rRNA,0,0,0,0,0,0
AC137488.1,0,0,0,0,0,0
AC137488.2,0,0,0,0,0,0
CU013544.1,0,0,0,0,0,0
CT867976.1,0,0,0,0,0,0
CT867977.1,0,0,0,0,0,0
CT978678.1,0,0,0,0,0,0
CU459202.1,0,0,0,0,0,0
AC116618.1,0,0,0,0,0,0
CU463998.1,0,0,0,0,0,0
CU463998.3,0,0,0,0,0,0
CU463998.2,0,0,0,0,0,0
U6,0,0,0,0,0,0
LA16c-60D12.1,0,0,0,3,2,0
LA16c-13E4.3,0,0,0,0,0,1
LA16c-60D12.2,0,0,0,0,4,1
ZNF72P,0,0,0,0,1,0

We can also use the more command to scroll through counts.csv. At the bottom of the page, more prints out the percentage of the file content shown in the screen. We can hit enter to scroll line by line or the space bar to scroll page by page. Hit q to exit more and return to the prompt. Note that on Biowulf, we cannot scroll up with more.

more counts.csv

Geneid,HBR_1.bam,HBR_2.bam,HBR_3.bam,UHR_1.bam,UHR_2.bam,UHR_3.bam
U2,0,0,0,0,0,0
CU459211.1,0,0,0,0,0,0
CU104787.1,0,0,0,0,0,0
BAGE5,0,0,0,0,0,0
ACTR3BP6,0,0,0,0,0,0
5_8S_rRNA,0,0,0,0,0,0
AC137488.1,0,0,0,0,0,0
AC137488.2,0,0,0,0,0,0
CU013544.1,0,0,0,0,0,0
CT867976.1,0,0,0,0,0,0
CT867977.1,0,0,0,0,0,0
CT978678.1,0,0,0,0,0,0
CU459202.1,0,0,0,0,0,0
AC116618.1,0,0,0,0,0,0
CU463998.1,0,0,0,0,0,0
CU463998.3,0,0,0,0,0,0
CU463998.2,0,0,0,0,0,0
U6,0,0,0,0,0,0
LA16c-60D12.1,0,0,0,3,2,0
LA16c-13E4.3,0,0,0,0,0,1
LA16c-60D12.2,0,0,0,0,4,1
ZNF72P,0,0,0,0,1,0
--More--(1%)

The less command allows for horizontal scrolling if we append the -S option. We can also combine it with the column command to print tabular data with the columns nicely aligned. The -t option in column counts the number of columns and creates a table, while -s option tells column the column separators in a data table (comma in this case, denoted by ',' in the command below). We pipe or send, using "|" the output of column to less -S. Hit q to get out of the following command and return to the prompt.

column -t -s ',' results.csv | less -S

name                baseMean  baseMeanA  baseMeanB  foldChange  log2FoldChange  lfcSE  stat    PValue    PAdj      FDR     falsePos  HBR_1.bam  HBR_2.bam  HBR_3.bam  UHR_1.bam  UHR_2.bam  UHR_3.bam
SYNGR1              526.9     1012.5     41.3       0.04        -4.6            0.15   -31.66  5.3e-220  5.2e-217  0       0         986.6      1025.6     1025.3     37.4       50.5       36
SEPT3               500.7     960.8      40.7       0.042       -4.6            0.15   -30.71  4.6e-207  4.5e-204  0       0         932.3      933        1017       37.4       37.6       47.1
YWHAH               797.4     1361.1     233.8      0.172       -2.5            0.09   -29.72  4.8e-194  4.7e-191  0       0         1330.6     1402.1     1350.5     232.5      217.8      251
RPL3                1710.7    828.2      2593.2     3.139       1.7             0.07   24.91   5.6e-137  5.4e-134  0       0         852.2      782.8      849.7      2787       2382.1     2610.6

Pattern searching in Unix

We can use grep to search for patterns in files. For instance, the command below will find the word alachua in text_1.txt. The syntax for grep is the command, followed by the pattern, and where we like to find the pattern (text_1.txt in this case).

grep gainesville text_1.txt

gainesville

If we use the -v option, we can select lines in a file that does not contain a pattern. In the grep command below, we will print out every line in text_1.txt that does not contain alachua.

grep -v gainesville text_1.txt

oranges
blue
bananas
cats
dogs
apple
florida
gators
alachua
county
btep