Lesson 7: Downloading the RNA-Seq Data and Dataset Overview
Lesson Review
pwd
(print working directory)ls
(list)touch
(creates an empty file)nano
(basic editor for creating small text files)- using the
rm
command to remove files. Be careful! mkdir
(make a directory) andrmdir
(remove a directory, must be empty of all files)cd
(change directory), by itself will take you home, cd .. (will take you up one directory), cd /results_dir/exp1 (go directly to this directory)mv
(for renaming files or moving files)less
(for viewing files, "more" is the older version of this)man
command (for viewing the man pages when you need help on a command)cp
(copy) for copying files- Flags and command options - making programs do what they do
- Wildcards (e.g.,
*
) - Tab complete - for less typing
- Accessing user history with the "up" and "down" arrows on the keyboard
cat
,head
, andtail
- print to screen, print first few lines to the screen, print last few lines to the screen- Working with file content (
<
,>
,>>
) - Combining commands with pipe (
|
). Where the heck is pipe anyway? - Finding information in files with
grep
- Performing repetitive actions with Unix (
for loop
), GNUparallel
- Permissions (
chmod
,chown
) wc
- number of lines (-l
), words (-w
), and bytes (-c
, usually one byte per character); for number of characters use-m
.grep
- search files using regular expressionscut
- cuts selected portions of a filefastq-dump
andfasterq-dump
- SRA file downloadssh
- secure shell protocol for remote login to Biowulf / Helix
Learning Objectives
- Introduce the RNA-Seq data
- Use
wget
andcurl
to download files - Learn to compress / decompress and unarchive files
- Learn
sed
andawk
for file editing
Getting Project files
UHR and HBR data
For this class, we are going to work with data from and associated with two commercially available sets of RNA samples, Universal Human Reference (UHR) and Human Brain Reference (HBR).
- UHR - bulk RNA from 10 cancer cell lines.
- HBR - bulk RNA from 23 bains; subjects were Caucasian, both sexes, and mostly between 60 and 80 years of age.
The data are paired-end with three replicates from each set (UHR, HBR).
These data are from:
Informatics for RNA-seq: A web resource for analysis on the cloud. 11(8):e1004393. PLoS Computational Biology (2015) by Malachi Griffith, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, Obi L. Griffith.
There is also an accompanying tutorial.
Downloading the data
Let's download the data and learn how to decompress it. First, we will create a place to store the data.
Go to the directory you created for working with class material. If you haven't created a class directory (biostar_class
), do that now.
mkdir biostar_class
Now, change to that directory.
cd biostar_class
mkdir -p RNA_Seq/raw_data
-p
flag do? Now, go to the raw_data
directory you have created.
cd RNA_Seq/raw_data
Now that we're in the correct directory, we will use curl
to download some bulk RNA-Seq data.
curl http://data.biostarhandbook.com/rnaseq/projects/griffith/griffith-data.tar.gz --output griffith-data.tar.gz
Let's take a look at this Unix command line...
The curl
command is used to retrieve data from web sites. A similar command is wget
. The Unix system you are working with may have either curl
or wget
installed. To see which is active on your system, just type the command at the command line like this...
wget
You may see an error like this if wget
is not installed.
-bash: wget: command not found
Next, try the curl
command.
curl
curl
is active on the system, you may see something like this...
curl: try 'curl --help' or 'curl --manual' for more information
We can do as the instructions say...
curl --help
curl
. So it looks like curl
is installed on this system.
Moving on. Let's take a look at this command line. We now know what curl
means, but how about the rest of it. The URL http://data.biostarhandbook.com/rnaseq/projects/griffith/griffith-data.tar.gz
represents the "path" to this data. As we have discussed, paths are a very important concept in Unix. An incorrect path can result in frustrating "file not found" errors.
The path to the file griffith-data.tar.gz
is data.biostarhandbook.com/rnaseq/projects/griffith
, which can be translated as - on the data.biostarhandbook.com
server, there is a directory (folder) named rnaseq
that contains projects
, which contains griffith
and subsequently the file griffith-data.tar.gz
. Notice how there are no blank spaces in the path name - Unix does not handle spaces in file names, directories, or paths easily.
Another way to get to this data file is via your browser. Open a browser window and enter http://data.biostarhandbook.com/rnaseq/projects/griffith
. You will see an index page listing all the directories at this location.
For example:
If you look closely, you will find a file named griffith-data.tar.gz
. What happens if you click on this link? Does it download? Can you open a tar file in the Mac environment? How about on PC? How would you do it?
Okay, let's take a look at the file name griffith-data.tar.gz
. What does the .tar.gz
extension mean? tar
refers to "tape archive" and is used to archive a set of files into a single file. The tar
command can also be used to compress an archive using some form of compression. The -z
flag, for example, compresses the archive using gzip
, which results in the extension .gz
. Note: gzip
is a command on its own and can be run independently.
How do we deal with tar.gz
files? On a Unix system, we untar and unzip the file using tar
with the flags -x
, -v
, and -f
. tar
auto-detects the compression type, so nothing specific is needed to handle the compression type.
tar -xvf filename.tar
-xvf
mean? If we check the man
page for tar
, we could find out...
man tar
-x
means - extract to disk from the tar (tape archive),
-v
means - produce verbose output. When using this flag tar will list each file name as it is read from the tar (tape archive).
-f
(file) means read the tar (tape archive) from or to the specified file.
Let's untar and unzip our file.
tar -xvf griffith-data.tar.gz
What happens when you run the tar command?
You should see each of the files listed as the tar is decompressed. Two directories were created in this process: a reads
directory and a refs
directory. In the reads
directory there are 12 fastq files. In the refs
directory, there are 4 files, containing genome and annotation information. Keep in mind that we will be using a subsetted reference file from human chromosome 22.
The fastq files are unzipped, but you may obtain zipped fastq files in the future. Because many bioinformatics programs can work directly with fastq.gz
files, let's compress these files to save space.
gzip reads/*.fq
*
wildcard. We are using gzip
to zip all files ending in .fq
in the directory reads
.
To peek inside these files after zipping you can use zcat
or gzcat
(for a mac) paired with head
. This works similar to cat
paired with head
zcat reads/HBR_1_R1.fq | head -n 8
In this case, we are "piping" - with the pipe symbol |
, the results of zcat
into head
and selecting the top 8 lines of the file (-n 8
).
The results should show the top 8 lines of the .fq.gz
file.
@HWI-ST718_146963544:7:2201:16660:89809/1
CAAAGAGAGAAAGAAAAGTCAATGATTTTATAGCCAGGCAAAATGACTTTCAAGTAAAAAATATAAAGCACCTTACAAACTAGTATCAAAATGCATTTCT
+
CCCFFFFFHHHHHJJJJJHIHIJJIJJJJJJJJJJJJIJJJJJJJJJJJJJIJJIIJJJJJJJJJJJJIIJFHHHEFFFFFEEEEEEEDDDDCDDEEDEE
@HWI-ST718_146963544:7:2215:16531:12741/1
CAAAATATTTTTTTTTTCTGTATATGACAAGACACACATCAGATCATAAGCTACAAGAAAACAAACAAAAAAGATATGAAAAAGATATAAAGACCTCCCC
+
@@@DDDDDFFFFFIIII;??::::9?99?G8;)9/8'787.)77;@==D=?;?A>D?@BDC@?CC=?BBBBB?<:4::@BBBB<?:>:@DD343<>:?BB
Keep in mind, there are several Unix commands that can be used to look at the contents of files, each has it's own flags/options and is used slightly differently. For example:
less
more
cat
head
tail
less
, in particular, can also be used to examine zipped files with the help of lesspipe
, on certain unix systems. On Biowulf, for example, you can use less
to view compressed /archived files.
A brief introduction to awk
and sed
What is awk
?
A scripting language that can be used for manipulating data and generating reports.
Awk is a utility that enables a programmer to write tiny but effective programs in the form of statements that define text patterns that are to be searched for in each line of a document and the action that is to be taken when a match is found within a line. Awk is mostly used for pattern scanning and processing. It searches one or more files to see if they contain lines that matches with the specified patterns and then performs the associated actions. ---https://www.geeksforgeeks.org/awk-command-unixlinux-examples/
awk
works line by line. The basic syntax is
awk 'CONDITION { ACTIONS }'
For each line, awk tries to match the CONDITION, and if that condition matches, it performs the ACTIONS. ---Biostar Handbook, The Art of Bioinformatics Scripting
There can be multiple conditions and actions.
Let's see awk
in action. Let's return to runinfo.csv
from Lesson 6. We can use awk
to print columns of interest.
For example,
cd ../../sra_data
awk -F ',' '{ print $1,$4,$7 }' runinfo.csv | head > awk_example.txt
$1
, fourth $4
, and seventh $7
columns from runinfo.csv
. Since there is no condition to be met, awk
acts on all lines. The -F
flag is used to specify the field separator. In this case, we are looking at a comma separated file, so we use ,
. If we also want the output to be comma separated, we need to use the special awk
variable OFS
.
awk -F ',' '{ OFS=","; print $1,$4,$7 }' runinfo.csv
There are many resources online for getting started with awk
. There is a chapter in the Biostar Handbook, IV Awk Programming in the Art of Bioinformatics Scripting. You may also find this article series explaining awk
one-liners handy.
What is sed
?
sed
stands for stream editor. Functions include searching, find and replace, and insertion / deletion.
sed
is often used for its "find and replace" capabilities.
For example, let's replace "SRR" in awk_example.txt
with "ACC".
sed 's/SRR/ACC/' awk_example.txt
s
specifies sed's substitution command, while the /
s separate the search pattern and the replacement string. The first occurrence in each line will be substituted. To substitute across all occurrences in a line use the global obtion 's/SRR/ACC/g'
.
You can pair sed
with regular expressions. For example, let's say we want to replace a few of the run accessions, those ending with a "17", "18", or "19", with "Unknown".
sed 's/SRR197291./Unknown/' awk_example.txt
The .
in this context means any character.
For more information, see https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/. Also, check out these handy one-liners and their explanations.
Help Session
For this help session, you will be downloading the Golden Snidget data. Practice materials are located here.