07. How much data in the SRA copy
This page uses content directly from the Biostar Handbook by Istvan Albert.
Review: * cd * mkdir * curl * tar * cat * grep * wc * outputting data * piping data from one command to another * cut
Learn: * du * pip * csvkit * datamash
In this session, we are going to: 1. Download SRA data (runinfo) in csv format 2. Decompress the data with "tar" 3. Look at the data in various ways, including: * check sra directory size with "du" * get the number of SRR runs using "grep" and "wc" * count the number of bases in the data with "datamash" * tabulate the data - number of runs per machine, per machine model and per organism 4. Along the way we will use "pip" to install "csvkit" software, and create a temporary alias called "tabulate"
Always remember to activate the bioinformatics environment first.
conda activate bioinfo
Let's create a new directory for this data, inside our biostar_class directory.
cd biostar_class
mkdir sra_runs
cd sra_runs
How much data is in the NCBI SRA (Sequence Read Archive)?
First we will obtain the SRA data from the biostar handbook web site
curl http://data.biostarhandbook.com/sra/sra-runinfo-2019-01.tar.gz --output sra-runinfo-2019-01.tar.gz
Now we can unpack the data.
tar xzvf sra-runinfo-2019-01.tar.gz
-x extract to disk from the archive
-z compress the remaining archive with gzip
-v produce verbose output
-f read the archive from or write the archive to the specified file
Now check the size of the unpacked directory.
du -hs sra
man du
-h produce human-readable output
-s display an entry for each specified file
The sra directory was produced when we did the "tar" extraction. How big is it?
2.7G sra
cat sra/*.csv | grep SRR > allruns.csv
We use "cat" to open and read all (* is a wildcard) of the ".csv" files in the "sra" directory. We pipe that output into the "grep" command and search for the pattern "SRR". Output is placed in the file "allruns.csv".
or...you could "cd" to "sra" and run the command from there.
cd sra
cat *.csv | grep SRR > allruns.csv
cat allruns.csv | wc -l
gives us
4373757
How to extract columns from comma separated files.
We will download a new tool "csvkit" for working on the comma separated files. This tool is careful to cut only at the columns used as delimiters, and not commas that are part of the text. First we need to install "csvkit".
pip install csvkit
What is "pip"?
pip --help
Commands:
install Install packages
etc.
value_one, value_two, value_three, value_four, value_five
value_one is red,blue,green
value_two is purple,orange,yellow
cat allruns.csv | csvcut -c 5 | datamash sum 1
Open and read "allruns.csv" with the "cat" command, "pipe" that output into the command "csvcut" and cut column 5, and use "datamash" to perform a command-line numeric operation on that column. In this case, "sum" the values of field 1, which are the column 5 values. Gives the answer.
2.25739301947e+16
Creating an alias.
We will create an alias for this command.
sort | uniq -c | sort -rn
The "uniq" utility reads input, compares adjacent lines and writes a copy of each UNIQUE input line to the output file.
uniq -c, precede each output line with the number of times the line occurred in the input, followed by a single space
sort -r , sort in reverse order
sort -n , sort numerically
Sort the input and find unique lines, then sort again numerically in reverse order (highest value first).
alias tabulate='sort | uniq -c | sort -rn'
alias
alias cp='cp -i'
alias ls='ls - hG'
alias mv='mv - i'
alias rm='rm -i'
Tabulating data in the SRA.
So, if column 29 is organism - how would you tabulate all organisms in the SRA? What about the number of sequencing runs (column 19) or the runs per instrument model (column 20)?
cat allruns.csv | csvcut -c 29 | tabulate | head