Lesson 7: Downloading data, viewing file content, and data wrangling in Unix
Quick review:
In this course series, we have learned how to connect to and navigate around Biowulf. In addition, we have learned how to use applications installed on Biowulf to download sequencing data from the SRA (using fastq-dump
) and subsequently, assess the quality of the downloaded sequencing data (using fastqc
). Further, we learned to transfer files from Biowulf to our local computer (using scp
). Finally, we learned to request an interactive session (using sinteractive
) or submit a batch job (using sbatch
) to perform compute intensive tasks.
Lesson objectives:
After this lesson, we should
- Be able to download data from the web
- Know how to view file content
- Know how to perform pattern search
Unix commands that we will learn in this lesson
wget
(to download data from the web)curl
(to download data from the web)tar
(to unpack tape archives)unzip
(to unpack zipped files)cat
(to display file content)head
(to display beginning of file content; defaults to first 10 lines)tail
(to display end of file content; defaults to last 10 lines)zcat
(to display compressed file content)less
andmore
(to scroll through files)grep
(to search for patterns)
Downloading data from URL
In Unix, we can use wget
or curl
to download data from URL.
As an example, let's download the sequncing data (FASTQ) files for the Human Brain Reference (HBR) and Universal Human Reference (UHR) from the Griffith lab RNA sequencing tutorial. You can read more about the HBR-UHR dataset on that page.
Before getting started, let's use pwd
to make sure that we are in our data directory
pwd
/data/username
If not change into it
cd /data/username
The next step is to create a folder to store the HBR-UHR sequencing data. Let's call this folder hbr_uhr_rna_sequencing. To create this folder, we will use the mkdir
command.
mkdir hbr_uhr_rna_sequencing
After creating hbr_uhr_rna_sequencing, change into it.
cd hbr_uhr_rna_sequencing
To download the FASTQ files for the HBR-UHR dataset, type wget
at the command line followed by the URL of the dataset, which is http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar.
wget http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar
During the download process, we will see a download progress bar in the terminal (Figure 1).
Figure 1: Unix wget
download progress.
Now, if we list the contents of the hbr_uhr_rna_sequencing folder
ls -l
We will see a tape archive or ".tar" file that we need to unpack to get to the HBR-UHR sequences. More on ".tar" files in a bit.
-rw-r-----. 1 wuz8 wuz8 116602880 Oct 23 2018 HBR_UHR_ERCC_ds_5pc.tar
First, let's remove HBR_UHR_ERCC_ds_5pc.tar and download it again using curl
.
To delete a file, recall that we use rm
followed by the name of the file that we want to delete.
rm HBR_UHR_ERCC_ds_5pc.tar
With the curl
command, we need to specify an output file name. We see two options for specifying the name of the output file with curl
if we look into help documents.
curl --help
-o, --output FILE Write output to <file> instead of stdout
-O, --remote-name Write output to a file named as the remote file
curl -O http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar
Listing the contents of the hbr_uhr_rna_sequencing directory, we see that the file HBR_UHR_ERCC_ds_5pc.tar appears when we use curl
with the -O
option, which writes a file that has the same name as that from the URL (ie. HBR_UHR_ERCC_ds_5pc.tar).
ls -l
-rw-r-----. 1 wuz8 wuz8 116602880 Jan 7 12:10 HBR_UHR_ERCC_ds_5pc.tar
Let's try downloading using curl
but specifying a file name of our choice using the -o
(lower case o) option. We will name the file HBR_UHR_READS.tar.
curl http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar -o HBR_UHR_READS.tar
Listing the contents of the hbr_uhr_rna_sequencing folder, we see that in addition to the file HBR_UHR_ERCC_ds_5pc.tar, which we downloaded using curl -O
, we have the file HBR_UHR_READS.tar, which we downloaded using curl
with the -o
option (where we specified an output file name of our choice, rather than the one provided by the URL).
ls -l
-rw-r-----. 1 wuz8 wuz8 116602880 Jan 7 12:10 HBR_UHR_ERCC_ds_5pc.tar
-rw-r-----. 1 wuz8 wuz8 116602880 Jan 7 12:16 HBR_UHR_READS.tar
Let's go ahead and remove HBR_UHR_ERCC_ds_5pc.tar because it is the same as HBR_UHR_READS.tar.
rm HBR_UHR_ERCC_ds_5pc.tar
Tar files and how to unpack them
Earlier, we mentioned that the ".tar" extension stands for Tape Archive. Tape Archive allows us to package many files and folders into a single file for easy transfer and sharing. We use the tar
command to unpack these files. Options for the tar
command can be found by using the command below.
tar --help
The options that we will use for unpacking are below. Note that we can use a single "-" to string together options in Unix commands.
-x, --extract, --get extract files from an archive
-v, --verbose verbosely list files processed
-f, --file=ARCHIVE use archive file or device ARCHIVE
tar -xvf HBR_UHR_READS.tar
Because we included the -v
option in the tar
command above, we see the files that are unpacked as the command runs.
HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
UHR_Rvep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
Note that sequencing data from the HBR-UHR dataset come as FASTQ files but are g-zipped (.gz). We will learn how to view data in the Unix terminal and point out how to view FASTQ files that are g-zipped without having to unzip them. Note that some bioinformatics software such as FASTQC can take g-zipped FASTQ files (ie. fastq.gz) as input. Check with the help documents for each software to find out.
Viewing file content in Unix
This portion of the lesson will focus on viewing contents of files in Unix. We will focus on the three types of files below.
- Plain text
- Tabular data (ie. data files that have many columns and many rows, like a matrix); these data tables can have columns that are
- comma separated (csv)
- tab separated (these will come in the form of txt files)
- FASTQ files, which contain high throughput sequencing data
Viewing plain text files in Unix
For this portion of the lesson, let's change back into our data directory. Again, username is the username you used to sign into Biowulf, this could be your NIH username if you have Biowulf a account or one of the student accounts that were setup for us.
cd /data/username
Next, we are going to the course documents and use wget
to grab the file unix_on_biowulf_2023.zip (this is under the section labeled Course data in the course documents)
wget https://btep.ccr.cancer.gov/docs/unix-on-biowulf-2023/data/unix_on_biowulf_2023.zip
We will then use unzip
to unpack the contents of unix_on_biowulf_2023.zip.
unzip unix_on_biowulf_2023.zip
Note that we get a status of what is being unpacked as the unzipping occurs.
Archive: unix_on_biowulf_2023.zip
creating: unix_on_biowulf_2023/
inflating: unix_on_biowulf_2023/text_1.txt
inflating: unix_on_biowulf_2023/counts.csv
inflating: unix_on_biowulf_2023/results.csv
Listing the contents of our data folder, we will see a new folder called unix_on_biowulf_2023 (let's change into this).
ls -l
drwxr-xr-x. 2 wuz8 wuz8 4096 Jan 2 14:44 unix_on_biowulf_2023
cd unix_on_biowulf_2023
We will list the contents of the unix_on_biowulf_2023 directory to see what we have to work with.
ls
We have a gene expression counts table from the HBR-UHR dataset (counts.csv), the differential expression analysis results from the HBR-UHR dataset (results.csv), and a random text file (text_1.txt).
counts.csv results.csv text_1.txt
Let's see what is in text_1.txt by using cat
cat text_1.txt
oranges
blue
bananas
cats
dogs
apple
florida
gators
gainesville
alachua
county
btep
The head
command can be used to view the top several lines of a file (default is 10 lines). We can use the -n
option to specify how many lines we want (for instance -n 5
will show the first five lines).
head -n 5 text_1.txt
oranges
blue
bananas
cats
dogs
Opposite of head
, tail
will show the bottom 10 lines of a file by default. Again, we can use -n
to specify the number lines other the default.
tail -n 5 text_1.txt
gators
gainesville
alachua
county
btep
We can use the zcat
command to view contents of compressed files without uncompressing them. For instance, the FASTQ files that we downloaded for the HBR-UHR dataset. We will stay in the unix_on_biowulf_2023 directory for this but will append the "../hbr_uhr_rna_sequencing" to reference the directory in which the FASTQ files are in. ".." tells Unix to go up one directory and then look in the folder hbr_uhr_rna_sequencing. We will use "|" or the pipe to send the output of zcat
to the head -n 4
command, to get the first four lines of the FASTQ file HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz.
zcat ../hbr_uhr_rna_sequencing/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz | head -n 4
The FASTQ file contains many sequencing reads and these come in 4 lines each, which are
- Metadata header that starts with "@"
- Actual sequence
- "+"
- Error likelihood of each of bases along the read
@HWI-ST718_146963544:7:2201:16660:89809/1
CAAAGAGAGAAAGAAAAGTCAATGATTTTATAGCCAGGCAAAATGACTTTCAAGTAAAAAATATAAAGCACCTTACAAACTAGTATCAAAATGCATTTCT
+
CCCFFFFFHHHHHJJJJJHIHIJJIJJJJJJJJJJJJIJJJJJJJJJJJJJIJJIIJJJJJJJJJJJJIIJFHHHEFFFFFEEEEEEEDDDDCDDEEDEE
For larger files that have a lot of rows, we can use less
to scroll. For instance, if we cat counts.csv
, it will print the entire file to the terminal. So to view parts of it at a time while being able to scroll we can use the less
command, which is known as a terminal pager. Note that we can use the down arrow to scroll down a file and the up arrow to scroll up when using less
.
less counts.csv
The counts.csv file is a gene expression counts table and has seven columns, where the first column contains gene IDs. Note that the columns are separated by commas as suggested by the ".csv" extension.
Hit q to get out of less
and return to the prompt.
Geneid,HBR_1.bam,HBR_2.bam,HBR_3.bam,UHR_1.bam,UHR_2.bam,UHR_3.bam
U2,0,0,0,0,0,0
CU459211.1,0,0,0,0,0,0
CU104787.1,0,0,0,0,0,0
BAGE5,0,0,0,0,0,0
ACTR3BP6,0,0,0,0,0,0
5_8S_rRNA,0,0,0,0,0,0
AC137488.1,0,0,0,0,0,0
AC137488.2,0,0,0,0,0,0
CU013544.1,0,0,0,0,0,0
CT867976.1,0,0,0,0,0,0
CT867977.1,0,0,0,0,0,0
CT978678.1,0,0,0,0,0,0
CU459202.1,0,0,0,0,0,0
AC116618.1,0,0,0,0,0,0
CU463998.1,0,0,0,0,0,0
CU463998.3,0,0,0,0,0,0
CU463998.2,0,0,0,0,0,0
U6,0,0,0,0,0,0
LA16c-60D12.1,0,0,0,3,2,0
LA16c-13E4.3,0,0,0,0,0,1
LA16c-60D12.2,0,0,0,0,4,1
ZNF72P,0,0,0,0,1,0
We can also use the more
command to scroll through counts.csv. At the bottom of the page, more
prints out the percentage of the file content shown in the screen. We can hit enter to scroll line by line or the space bar to scroll page by page. Hit q to exit more
and return to the prompt. Note that on Biowulf, we cannot scroll up with more
.
more counts.csv
Geneid,HBR_1.bam,HBR_2.bam,HBR_3.bam,UHR_1.bam,UHR_2.bam,UHR_3.bam
U2,0,0,0,0,0,0
CU459211.1,0,0,0,0,0,0
CU104787.1,0,0,0,0,0,0
BAGE5,0,0,0,0,0,0
ACTR3BP6,0,0,0,0,0,0
5_8S_rRNA,0,0,0,0,0,0
AC137488.1,0,0,0,0,0,0
AC137488.2,0,0,0,0,0,0
CU013544.1,0,0,0,0,0,0
CT867976.1,0,0,0,0,0,0
CT867977.1,0,0,0,0,0,0
CT978678.1,0,0,0,0,0,0
CU459202.1,0,0,0,0,0,0
AC116618.1,0,0,0,0,0,0
CU463998.1,0,0,0,0,0,0
CU463998.3,0,0,0,0,0,0
CU463998.2,0,0,0,0,0,0
U6,0,0,0,0,0,0
LA16c-60D12.1,0,0,0,3,2,0
LA16c-13E4.3,0,0,0,0,0,1
LA16c-60D12.2,0,0,0,0,4,1
ZNF72P,0,0,0,0,1,0
--More--(1%)
The less
command allows for horizontal scrolling if we append the -S
option. We can also combine it with the column
command to print tabular data with the columns nicely aligned. The -t
option in column
counts the number of columns and creates a table, while -s
option tells column
the column separators in a data table (comma in this case, denoted by ',' in the command below). We pipe or send, using "|" the output of column
to less -S
. Hit q to get out of the following command and return to the prompt.
column -t -s ',' results.csv | less -S
name baseMean baseMeanA baseMeanB foldChange log2FoldChange lfcSE stat PValue PAdj FDR falsePos HBR_1.bam HBR_2.bam HBR_3.bam UHR_1.bam UHR_2.bam UHR_3.bam
SYNGR1 526.9 1012.5 41.3 0.04 -4.6 0.15 -31.66 5.3e-220 5.2e-217 0 0 986.6 1025.6 1025.3 37.4 50.5 36
SEPT3 500.7 960.8 40.7 0.042 -4.6 0.15 -30.71 4.6e-207 4.5e-204 0 0 932.3 933 1017 37.4 37.6 47.1
YWHAH 797.4 1361.1 233.8 0.172 -2.5 0.09 -29.72 4.8e-194 4.7e-191 0 0 1330.6 1402.1 1350.5 232.5 217.8 251
RPL3 1710.7 828.2 2593.2 3.139 1.7 0.07 24.91 5.6e-137 5.4e-134 0 0 852.2 782.8 849.7 2787 2382.1 2610.6
Pattern searching in Unix
We can use grep
to search for patterns in files. For instance, the command below will find the word alachua in text_1.txt. The syntax for grep
is the command, followed by the pattern, and where we like to find the pattern (text_1.txt in this case).
grep gainesville text_1.txt
gainesville
If we use the -v
option, we can select lines in a file that does not contain a pattern. In the grep
command below, we will print out every line in text_1.txt that does not contain alachua.
grep -v gainesville text_1.txt
oranges
blue
bananas
cats
dogs
apple
florida
gators
alachua
county
btep