07. How much data in the SRA copy

This page uses content directly from the Biostar Handbook by Istvan Albert.

Review: * cd * mkdir * curl * tar * cat * grep * wc * outputting data * piping data from one command to another * cut

Learn: * du * pip * csvkit * datamash

In this session, we are going to: 1. Download SRA data (runinfo) in csv format 2. Decompress the data with "tar" 3. Look at the data in various ways, including: * check sra directory size with "du" * get the number of SRR runs using "grep" and "wc" * count the number of bases in the data with "datamash" * tabulate the data - number of runs per machine, per machine model and per organism 4. Along the way we will use "pip" to install "csvkit" software, and create a temporary alias called "tabulate"

Always remember to activate the bioinformatics environment first.

conda activate bioinfo

Let's create a new directory for this data, inside our biostar_class directory.

cd biostar_class
mkdir sra_runs
cd sra_runs

How much data is in the NCBI SRA (Sequence Read Archive)?

First we will obtain the SRA data from the biostar handbook web site

curl http://data.biostarhandbook.com/sra/sra-runinfo-2019-01.tar.gz --output sra-runinfo-2019-01.tar.gz

Now we can unpack the data.

tar xzvf sra-runinfo-2019-01.tar.gz

Remember the flags for tar? What do they mean?

-x extract to disk from the archive

-z compress the remaining archive with gzip

-v produce verbose output

-f read the archive from or write the archive to the specified file

Now check the size of the unpacked directory.

du -hs sra

Here is another new Unix command, "du". This command "displays disk usage statistics". How would you find out what "du" does and what the flags/options are for working with it?

man du

-h produce human-readable output

-s display an entry for each specified file

The sra directory was produced when we did the "tar" extraction. How big is it?

2.7G sra

Let's find out how many runs there were by pulling out the lines that contain SRR numbers. From the sra_runs directory, run:

cat sra/*.csv | grep SRR > allruns.csv

What is this command line doing?

We use "cat" to open and read all (* is a wildcard) of the ".csv" files in the "sra" directory. We pipe that output into the "grep" command and search for the pattern "SRR". Output is placed in the file "allruns.csv".

or...you could "cd" to "sra" and run the command from there.

cd sra
cat *.csv | grep SRR > allruns.csv

How many runs are there?

cat allruns.csv | wc -l

Use the "cat" command to open and read "allruns.csv" and pipe that output into "wc" (word count) "-l" to count only the lines.

gives us

So there are 4373757 SRR runs in this data.

How to extract columns from comma separated files.

We will download a new tool "csvkit" for working on the comma separated files. This tool is careful to cut only at the columns used as delimiters, and not commas that are part of the text. First we need to install "csvkit".

pip install csvkit

What is "pip"?

pip --help

shows

Commands: 
install Install packages
etc.

csvkit is a package for working with comma-separated values (csv) files. This is to ensure that when we "cut" to remove columns, we are cutting only at commas that delineate values, not commas that could be part of a text entry. For example, we'd want to cut:

value_one, value_two, value_three, value_four, value_five

at the "commas", since they are the placeholders between the values, but not cut here:

value_one is red,blue,green 
value_two is purple,orange,yellow

For example:

cat allruns.csv |  csvcut -c 5 | datamash sum 1

Open and read "allruns.csv" with the "cat" command, "pipe" that output into the command "csvcut" and cut column 5, and use "datamash" to perform a command-line numeric operation on that column. In this case, "sum" the values of field 1, which are the column 5 values. Gives the answer.

2.25739301947e+16

Creating an alias.

We will create an alias for this command.

sort | uniq -c | sort -rn

The "sort" utility sorts text and binary files by lines.

The "uniq" utility reads input, compares adjacent lines and writes a copy of each UNIQUE input line to the output file.

uniq -c, precede each output line with the number of times the line occurred in the input, followed by a single space

sort -r , sort in reverse order

sort -n , sort numerically

Sort the input and find unique lines, then sort again numerically in reverse order (highest value first).

alias tabulate='sort | uniq -c | sort -rn'

Let's see all the aliases we have set previously.

alias

shows

alias cp='cp -i'
alias ls='ls - hG'
alias mv='mv - i'
alias rm='rm -i'

Tabulating data in the SRA.

So, if column 29 is organism - how would you tabulate all organisms in the SRA? What about the number of sequencing runs (column 19) or the runs per instrument model (column 20)?

cat allruns.csv | csvcut -c 29 | tabulate | head