Lesson 7: Downloading the RNA-Seq Data and Dataset Overview

Lesson Review

pwd (print working directory)
ls (list)
touch (creates an empty file)
nano (basic editor for creating small text files)
using the rm command to remove files. Be careful!
mkdir (make a directory) and rmdir (remove a directory, must be empty of all files)
cd (change directory), by itself will take you home, cd .. (will take you up one directory), cd /results_dir/exp1 (go directly to this directory)
mv (for renaming files or moving files)
less (for viewing files, "more" is the older version of this)
man command (for viewing the man pages when you need help on a command)
cp (copy) for copying files
Flags and command options - making programs do what they do
Wildcards (e.g., *)
Tab complete - for less typing
Accessing user history with the "up" and "down" arrows on the keyboard
cat, head, and tail - print to screen, print first few lines to the screen, print last few lines to the screen
Working with file content (<, >, >>)
Combining commands with pipe (|). Where the heck is pipe anyway?
Finding information in files with grep
Performing repetitive actions with Unix (for loop), GNU parallel
Permissions (chmod,chown)
wc - number of lines (-l), words (-w), and bytes (-c, usually one byte per character); for number of characters use -m.
grep- search files using regular expressions
cut - cuts selected portions of a file
fastq-dump and fasterq-dump - SRA file download
ssh - secure shell protocol for remote login to Biowulf / Helix

Learning Objectives

Introduce the RNA-Seq data
Use wget and curl to download files
Learn to compress / decompress and unarchive files
Learn sed and awk for file editing

Getting Project files

UHR and HBR data

For this class, we are going to work with data from and associated with two commercially available sets of RNA samples, Universal Human Reference (UHR) and Human Brain Reference (HBR).

UHR - bulk RNA from 10 cancer cell lines.
HBR - bulk RNA from 23 bains; subjects were Caucasian, both sexes, and mostly between 60 and 80 years of age.

The data are paired-end with three replicates from each set (UHR, HBR).

These data are from:
Informatics for RNA-seq: A web resource for analysis on the cloud. 11(8):e1004393. PLoS Computational Biology (2015) by Malachi Griffith, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, Obi L. Griffith.

There is also an accompanying tutorial.

Downloading the data

Let's download the data and learn how to decompress it. First, we will create a place to store the data.

Go to the directory you created for working with class material. If you haven't created a class directory (biostar_class), do that now.

mkdir biostar_class

Now, change to that directory.

cd biostar_class

Create a directory for the data we are going to download.

mkdir -p RNA_Seq/raw_data

What does the -p flag do? Now, go to the raw_data directory you have created.

cd RNA_Seq/raw_data

Now that we're in the correct directory, we will use curl to download some bulk RNA-Seq data.

curl http://data.biostarhandbook.com/rnaseq/projects/griffith/griffith-data.tar.gz --output griffith-data.tar.gz

Let's take a look at this Unix command line... The curl command is used to retrieve data from web sites. A similar command is wget. The Unix system you are working with may have either curl or wget installed. To see which is active on your system, just type the command at the command line like this...

wget

You may see an error like this if wget is not installed.

-bash: wget: command not found

Next, try the curl command.

curl

If curl is active on the system, you may see something like this...

curl: try 'curl --help' or 'curl --manual' for more information

We can do as the instructions say...

curl --help

and see information on the usage of curl. So it looks like curl is installed on this system.

Moving on. Let's take a look at this command line. We now know what curl means, but how about the rest of it. The URL http://data.biostarhandbook.com/rnaseq/projects/griffith/griffith-data.tar.gz represents the "path" to this data. As we have discussed, paths are a very important concept in Unix. An incorrect path can result in frustrating "file not found" errors.

The path to the file griffith-data.tar.gz is data.biostarhandbook.com/rnaseq/projects/griffith, which can be translated as - on the data.biostarhandbook.com server, there is a directory (folder) named rnaseq that contains projects, which contains griffith and subsequently the file griffith-data.tar.gz. Notice how there are no blank spaces in the path name - Unix does not handle spaces in file names, directories, or paths easily.

Another way to get to this data file is via your browser. Open a browser window and enter http://data.biostarhandbook.com/rnaseq/projects/griffith. You will see an index page listing all the directories at this location.

For example:

If you look closely, you will find a file named griffith-data.tar.gz. What happens if you click on this link? Does it download? Can you open a tar file in the Mac environment? How about on PC? How would you do it?

Okay, let's take a look at the file name griffith-data.tar.gz. What does the .tar.gz extension mean? tar refers to "tape archive" and is used to archive a set of files into a single file. The tar command can also be used to compress an archive using some form of compression. The -z flag, for example, compresses the archive using gzip, which results in the extension .gz. Note: gzip is a command on its own and can be run independently.

How do we deal with tar.gz files? On a Unix system, we untar and unzip the file using tar with the flags -x, -v, and -f. tar auto-detects the compression type, so nothing specific is needed to handle the compression type.

tar -xvf filename.tar

What does -xvf mean? If we check the man page for tar, we could find out...

man tar

-x means - extract to disk from the tar (tape archive),

-v means - produce verbose output. When using this flag tar will list each file name as it is read from the tar (tape archive).

-f (file) means read the tar (tape archive) from or to the specified file.

Let's untar and unzip our file.

tar -xvf griffith-data.tar.gz

What happens when you run the tar command?

You should see each of the files listed as the tar is decompressed. Two directories were created in this process: a reads directory and a refs directory. In the reads directory there are 12 fastq files. In the refs directory, there are 4 files, containing genome and annotation information. Keep in mind that we will be using a subsetted reference file from human chromosome 22.

The fastq files are unzipped, but you may obtain zipped fastq files in the future. Because many bioinformatics programs can work directly with fastq.gz files, let's compress these files to save space.

gzip reads/*.fq

Note the use of the * wildcard. We are using gzip to zip all files ending in .fq in the directory reads.

To peek inside these files after zipping you can use zcat or gzcat (for a mac) paired with head. This works similar to cat paired with head

zcat reads/HBR_1_R1.fq | head -n 8

In this case, we are "piping" - with the pipe symbol |, the results of zcat into head and selecting the top 8 lines of the file (-n 8).

The results should show the top 8 lines of the .fq.gz file.

@HWI-ST718_146963544:7:2201:16660:89809/1
CAAAGAGAGAAAGAAAAGTCAATGATTTTATAGCCAGGCAAAATGACTTTCAAGTAAAAAATATAAAGCACCTTACAAACTAGTATCAAAATGCATTTCT
+
CCCFFFFFHHHHHJJJJJHIHIJJIJJJJJJJJJJJJIJJJJJJJJJJJJJIJJIIJJJJJJJJJJJJIIJFHHHEFFFFFEEEEEEEDDDDCDDEEDEE
@HWI-ST718_146963544:7:2215:16531:12741/1
CAAAATATTTTTTTTTTCTGTATATGACAAGACACACATCAGATCATAAGCTACAAGAAAACAAACAAAAAAGATATGAAAAAGATATAAAGACCTCCCC
+
@@@DDDDDFFFFFIIII;??::::9?99?G8;)9/8'787.)77;@==D=?;?A>D?@BDC@?CC=?BBBBB?<:4::@BBBB<?:>:@DD343<>:?BB

Keep in mind, there are several Unix commands that can be used to look at the contents of files, each has it's own flags/options and is used slightly differently. For example:

less
more
cat
head 
tail

less, in particular, can also be used to examine zipped files with the help of lesspipe, on certain unix systems. On Biowulf, for example, you can use less to view compressed /archived files.

A brief introduction to `awk` and `sed`

What is `awk`?

A scripting language that can be used for manipulating data and generating reports.

Awk is a utility that enables a programmer to write tiny but effective programs in the form of statements that define text patterns that are to be searched for in each line of a document and the action that is to be taken when a match is found within a line. Awk is mostly used for pattern scanning and processing. It searches one or more files to see if they contain lines that matches with the specified patterns and then performs the associated actions. ---https://www.geeksforgeeks.org/awk-command-unixlinux-examples/

awk works line by line. The basic syntax is

awk 'CONDITION { ACTIONS }'

For each line, awk tries to match the CONDITION, and if that condition matches, it performs the ACTIONS. ---Biostar Handbook, The Art of Bioinformatics Scripting

There can be multiple conditions and actions.

Let's see awk in action. Let's return to runinfo.csv from Lesson 6. We can use awk to print columns of interest.

For example,

cd ../../sra_data  
awk -F ',' '{ print $1,$4,$7 }' runinfo.csv | head > awk_example.txt

Here, the action is to simply print the first $1, fourth $4, and seventh $7 columns from runinfo.csv. Since there is no condition to be met, awk acts on all lines. The -F flag is used to specify the field separator. In this case, we are looking at a comma separated file, so we use ,. If we also want the output to be comma separated, we need to use the special awk variable OFS.

awk -F ',' '{ OFS=","; print $1,$4,$7 }' runinfo.csv

There are many resources online for getting started with awk. There is a chapter in the Biostar Handbook, IV Awk Programming in the Art of Bioinformatics Scripting. You may also find this article series explaining awk one-liners handy.

What is `sed`?

sed stands for stream editor. Functions include searching, find and replace, and insertion / deletion.

sed is often used for its "find and replace" capabilities.

For example, let's replace "SRR" in awk_example.txt with "ACC".

sed 's/SRR/ACC/' awk_example.txt

Notice the single quotes containing our substitution phrase. The s specifies sed's substitution command, while the /s separate the search pattern and the replacement string. The first occurrence in each line will be substituted. To substitute across all occurrences in a line use the global obtion 's/SRR/ACC/g'.

You can pair sed with regular expressions. For example, let's say we want to replace a few of the run accessions, those ending with a "17", "18", or "19", with "Unknown".

sed 's/SRR197291./Unknown/' awk_example.txt

The . in this context means any character.

For more information, see https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/. Also, check out these handy one-liners and their explanations.

Help Session

For this help session, you will be downloading the Golden Snidget data. Practice materials are located here.