06. Downloading sequence data from the SRA (Sequence Read Archive) at NCBI copy
This page uses content directly from the Biostar Handbook by Istvan Albert.
Review: * downloading data from SRA * decompressing tar files * e-utilities * fastq-dump
Learn: * sra-stat * XML format * automating SRA downloads * working with comma-separated values (csv) format * cut * parallel
In this session, we are going to: 1. Download data from the SRA with "fastq-dump" * split files into forward and reverse reads * download part, not all, of the data 2. Look at XML-formatted data with "sra-stat" 3. Use "e-utilities" (e-search and e-fetch) to get "runinfo" csv-formatted data 4. Work with csv-formatted data using the "cut" command to isolate columns 5. Learn to automate data retrieval with the "parallel" command
Always remember to set the bioinformatics environment with every new Terminal window.
conda activate bioinfo
We'll be downloading data from the NCBI/SRA, let's create a new folder for this data within our "biostar_class" directory.
cd biostar_class
mkdir sra_data
cd sra_data
What is the SRA?
The SRA (Sequence Read Archive) at NCBI is a large, public database of DNA sequencing data. The repository holds "short reads" generated by high-throughput next-generation sequencing, usually less than 1,000 bp.
We will download data from the SRA using the command line package "sratoolkit", although you could also download from the web site.
Helpful documentation for the SRA. Look at each of these. 1. SRA handbook at NCBI 2. SRA Toolkit Documentation (NCBI) 3. GitHub SRA Toolkit
We will be using the tool "fastq-dump" to download FASTQ-formatted data. This command downloads the data in SRA format, and converts it to FASTQ format.
fastq-dump SRR1553607
SRR1553607.fastq
What are "spots"?
Spots are a legacy term referring to locations on the flow cell for Illumina sequencers. For our purposes, we can think of "spots" as "reads". When downloading "spots", always split the spots into original files using:
--split-files
For paired-end reads, we need to separate the data into two different files like this:
fastq-dump --split-files SRR1553607
SRR1553607_1.fastq
SRR1553607_2.fastq
fastq-dump --split-files -X 10000 SRR1553607
To generate an XML report on data sets that shows us the "spot", or read count, and the count of bases "base_count", including the size of the data file:
$ sra-stat --xml --quick SRR1553610
<Run accession="SRR1553610" spot_count="219837" base_count="44407074" base_count_bio="44407074" cmp_base_count="44407074">
<Member member_name="A" spot_count="219837" base_count="44407074" base_count_bio="44407074" cmp_base_count="44407074" library="NM042.3_r1" sample="NM042.3_r1"/>
<Size value="24940700" units="bytes"/>
<AlignInfo>
</AlignInfo>
<QualityCount>
<Quality value="2" count="10625034"/>
<Quality value="5" count="32319"/>
<Quality value="6" count="19302"/>
<Quality value="7" count="59898"/>
<Quality value="8" count="64340"/>
<Quality value="9" count="16649"/>
<Quality value="10" count="69974"/>
<Quality value="11" count="16943"/>
<Quality value="12" count="32543"/>
<Quality value="13" count="17576"/>
<Quality value="14" count="15896"/>
<Quality value="15" count="104797"/>
<Quality value="16" count="46497"/>
<Quality value="17" count="71027"/>
<Quality value="18" count="93656"/>
<Quality value="19" count="59225"/>
<Quality value="20" count="208389"/>
<Quality value="21" count="53511"/>
<Quality value="22" count="99178"/>
<Quality value="23" count="211361"/>
<Quality value="24" count="221444"/>
<Quality value="25" count="219366"/>
<Quality value="26" count="276627"/>
<Quality value="27" count="408456"/>
<Quality value="28" count="178196"/>
<Quality value="29" count="453004"/>
<Quality value="30" count="643473"/>
<Quality value="31" count="1195279"/>
<Quality value="32" count="662412"/>
<Quality value="33" count="2083024"/>
<Quality value="34" count="2018065"/>
<Quality value="35" count="8653053"/>
<Quality value="36" count="1237497"/>
<Quality value="37" count="2701757"/>
<Quality value="38" count="1264232"/>
<Quality value="39" count="2809650"/>
<Quality value="40" count="2295168"/>
<Quality value="41" count="5168256"/>
</QualityCount>
<Databases>
<Database>
<Table name="SEQUENCE">
<Statistics source="meta">
<Rows count="219837"/>
<Elements count="44407074"/>
</Statistics>
</Table>
</Database>
</Databases>
</Run>
spot_count="219837" base_count="44407074"
Automating multiple SRA runs downloads
From the publication REDO: Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak
First we get the project (PRJN) number from the publication: PRJNA257197
Next we're going to query the "sra" db using "esearch" from the "e-utilities" package.
esearch -db sra -query PRJNA257197
which gets us:
<ENTREZ_DIRECT>
<Db>sra</Db>
<WebEnv>NCID_1_12080050_130.14.18.97_9001_1588955749_413109598_0MetA0_S_MegaStore</WebEnv>
<QueryKey>1</QueryKey>
<Count>891</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
Now we will use "e-fetch", from the "e-utilities" package.
esearch -db sra -query PRJNA257197 | efetch -format runinfo > runinfo.csv
There is something new in this command line, the "runinfo" format, which gets us data in CSV, comma-separated value, format.
Open MS Excel, and navigate to your "biostar_class/sra_data" file to find the data.
We will modify the .csv file to pull out some info for our next query. To remove the first column of the file:
cat runinfo.csv | cut -f 1 -d ',' | head
We are opening the file with "cat", piping it to the "cut" command, and then piping it to head. So what is "cut"? It is a unix command, so we can look at more info with the "man" command.
man cut
Running the cat command above, we get:
Run
SRR1972917
SRR1972918
SRR1972919
SRR1972920
SRR1972921
SRR1972922
SRR1972923
SRR1972924
SRR1972925
Now we can filter for lines that match SRR and store the first ten run ids in a file.
cat runinfo.csv | cut -f 1 -d',' | grep SRR | head > runids.txt
SRR1972917
SRR1972918
SRR1972919
SRR1972920
SRR1972921
SRR1972922
SRR1972923
SRR1972924
SRR1972925
SRR1972926
fastq-dump -X 10000 --split-files SRR1972917
fastq-dump -X 10000 --split-files SRR1972918
fastq-dump -X 10000 --split-files SRR1972919
Using the "parallel" tool for automating downloads
The "parallel" tool executes commands in "parallel", one for each CPU core in your system. See Tool: Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them
cat runids.txt | parallel fastq-dump -X 10000 --split-files {}
What's up with the curly braces {}?
The curly braces are default behavior for the parallel command, so this gets the same result.
cat runids.txt | parallel fastq-dump -X 10000 --split-files
The real power of the "parallel" command comes when using something between the brackets such as "{.}", "{/}" and "{//}", like this... ``` less file list
/dir/subdir/file.txt /other/list.csv /raw/seqs.fastq
cat filelist | parallel echo {} "without extension" {.} /dir/subdir/file.txt without extension /dir/subdir/file /other/list.csv without extension /other/list /raw/seqs.fastq without extension /raw/seqs
cat filelist | parallel echo {} "without path" {/} /dir/subdir/file.txt without path file.txt /other/list.csv without path list.csv /raw/seqs.fastq without path seqs.fastq
cat filelist | parallel echo {} "with root" {//} /dir/subdir/file.txt with root /dir/subdir /other/list.csv with root /other /raw/seqs.fastq with root /raw