06. Downloading sequence data from the SRA (Sequence Read Archive) at NCBI copy

This page uses content directly from the Biostar Handbook by Istvan Albert.

Review: * downloading data from SRA * decompressing tar files * e-utilities * fastq-dump

Learn: * sra-stat * XML format * automating SRA downloads * working with comma-separated values (csv) format * cut * parallel

In this session, we are going to: 1. Download data from the SRA with "fastq-dump" * split files into forward and reverse reads * download part, not all, of the data 2. Look at XML-formatted data with "sra-stat" 3. Use "e-utilities" (e-search and e-fetch) to get "runinfo" csv-formatted data 4. Work with csv-formatted data using the "cut" command to isolate columns 5. Learn to automate data retrieval with the "parallel" command

Always remember to set the bioinformatics environment with every new Terminal window.

conda activate bioinfo

We'll be downloading data from the NCBI/SRA, let's create a new folder for this data within our "biostar_class" directory.

cd biostar_class
mkdir sra_data
cd sra_data

What is the SRA?

The SRA (Sequence Read Archive) at NCBI is a large, public database of DNA sequencing data. The repository holds "short reads" generated by high-throughput next-generation sequencing, usually less than 1,000 bp.

We will download data from the SRA using the command line package "sratoolkit", although you could also download from the web site.

Helpful documentation for the SRA. Look at each of these. 1. SRA handbook at NCBI 2. SRA Toolkit Documentation (NCBI) 3. GitHub SRA Toolkit

We will be using the tool "fastq-dump" to download FASTQ-formatted data. This command downloads the data in SRA format, and converts it to FASTQ format.

fastq-dump SRR1553607

creates the file:

SRR1553607.fastq

Check the file to make sure it is fastq format. How would you do this? What is FASTQ format?

What are "spots"?

Spots are a legacy term referring to locations on the flow cell for Illumina sequencers. For our purposes, we can think of "spots" as "reads". When downloading "spots", always split the spots into original files using:

--split-files

For paired-end reads, we need to separate the data into two different files like this:

fastq-dump --split-files SRR1553607

which creates

SRR1553607_1.fastq
SRR1553607_2.fastq

The fastq-dump command first downloads the data in SRA formate, then converts it to FASTQ. If we want to work with a subset of the data, for example the first 10,000 reads:

fastq-dump --split-files -X 10000 SRR1553607

To generate an XML report on data sets that shows us the "spot", or read count, and the count of bases "base_count", including the size of the data file:

$ sra-stat --xml --quick SRR1553610

which yields this data in XML format:

<Run accession="SRR1553610" spot_count="219837" base_count="44407074" base_count_bio="44407074" cmp_base_count="44407074">
  <Member member_name="A" spot_count="219837" base_count="44407074" base_count_bio="44407074" cmp_base_count="44407074" library="NM042.3_r1" sample="NM042.3_r1"/>
  <Size value="24940700" units="bytes"/>
  <AlignInfo>
  </AlignInfo>
  <QualityCount>
    <Quality value="2" count="10625034"/>
    <Quality value="5" count="32319"/>
    <Quality value="6" count="19302"/>
    <Quality value="7" count="59898"/>
    <Quality value="8" count="64340"/>
    <Quality value="9" count="16649"/>
    <Quality value="10" count="69974"/>
    <Quality value="11" count="16943"/>
    <Quality value="12" count="32543"/>
    <Quality value="13" count="17576"/>
    <Quality value="14" count="15896"/>
    <Quality value="15" count="104797"/>
    <Quality value="16" count="46497"/>
    <Quality value="17" count="71027"/>
    <Quality value="18" count="93656"/>
    <Quality value="19" count="59225"/>
    <Quality value="20" count="208389"/>
    <Quality value="21" count="53511"/>
    <Quality value="22" count="99178"/>
    <Quality value="23" count="211361"/>
    <Quality value="24" count="221444"/>
    <Quality value="25" count="219366"/>
    <Quality value="26" count="276627"/>
    <Quality value="27" count="408456"/>
    <Quality value="28" count="178196"/>
    <Quality value="29" count="453004"/>
    <Quality value="30" count="643473"/>
    <Quality value="31" count="1195279"/>
    <Quality value="32" count="662412"/>
    <Quality value="33" count="2083024"/>
    <Quality value="34" count="2018065"/>
    <Quality value="35" count="8653053"/>
    <Quality value="36" count="1237497"/>
    <Quality value="37" count="2701757"/>
    <Quality value="38" count="1264232"/>
    <Quality value="39" count="2809650"/>
    <Quality value="40" count="2295168"/>
    <Quality value="41" count="5168256"/>
  </QualityCount>
  <Databases>
    <Database>
      <Table name="SEQUENCE">
        <Statistics source="meta">
          <Rows count="219837"/>
          <Elements count="44407074"/>
        </Statistics>
      </Table>
    </Database>
  </Databases>
</Run>

where we can see the spot count, the base count, and the number of reads corresponding to quality scores.

spot_count="219837" base_count="44407074"

Side bar: (XML) is Extensible Markup Language, a document format that is both human and machine-readable. XML aims for: "simplicity, generality and usability across the Internet".

Automating multiple SRA runs downloads

From the publication REDO: Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak

First we get the project (PRJN) number from the publication: PRJNA257197

Next we're going to query the "sra" db using "esearch" from the "e-utilities" package.

esearch -db sra -query PRJNA257197

which gets us:

<ENTREZ_DIRECT>
  <Db>sra</Db>
  <WebEnv>NCID_1_12080050_130.14.18.97_9001_1588955749_413109598_0MetA0_S_MegaStore</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>891</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

This is XML format, and we can see there are 891 sequencing runs.

Now we will use "e-fetch", from the "e-utilities" package.

esearch -db sra -query PRJNA257197 | efetch -format runinfo > runinfo.csv

There is something new in this command line, the "runinfo" format, which gets us data in CSV, comma-separated value, format.

Open MS Excel, and navigate to your "biostar_class/sra_data" file to find the data.

Screen Shot 2020-05-08 at 1 01 50 PM

We will modify the .csv file to pull out some info for our next query. To remove the first column of the file:

cat runinfo.csv | cut -f 1 -d ',' | head

We are opening the file with "cat", piping it to the "cut" command, and then piping it to head. So what is "cut"? It is a unix command, so we can look at more info with the "man" command.

man cut

There we find out that "cut" is used to cut out selected portions of the file. The "-f" option specifies with field to cut (the first field), and "-d" tells the program which delimiter to use. We know this is a comma-separate value file, so we do "-d ','" to specify the comma.

Running the cat command above, we get:

Run
SRR1972917
SRR1972918
SRR1972919
SRR1972920
SRR1972921
SRR1972922
SRR1972923
SRR1972924
SRR1972925

Now we can filter for lines that match SRR and store the first ten run ids in a file.

cat runinfo.csv | cut -f 1 -d',' | grep SRR | head > runids.txt

and get

SRR1972917
SRR1972918
SRR1972919
SRR1972920
SRR1972921
SRR1972922
SRR1972923
SRR1972924
SRR1972925
SRR1972926

If we want to retrieve files for each SRR number we can use the fastq-dump command like this.

fastq-dump -X 10000 --split-files SRR1972917
fastq-dump -X 10000 --split-files SRR1972918
fastq-dump -X 10000 --split-files SRR1972919

Or, we could use the "parallel" tool to automate downloads

Using the "parallel" tool for automating downloads

The "parallel" tool executes commands in "parallel", one for each CPU core in your system. See Tool: Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

cat runids.txt | parallel fastq-dump -X 10000 --split-files {}

This command gives us twenty files, two files for each run and 10 runs. It takes the place of the multiple "fastq-dump" commands we would need to use.

What's up with the curly braces {}?

The curly braces are default behavior for the parallel command, so this gets the same result.

cat runids.txt | parallel fastq-dump -X 10000 --split-files

The real power of the "parallel" command comes when using something between the brackets such as "{.}", "{/}" and "{//}", like this... ``` less file list

/dir/subdir/file.txt /other/list.csv /raw/seqs.fastq

cat filelist | parallel echo {} "without extension" {.} /dir/subdir/file.txt without extension /dir/subdir/file /other/list.csv without extension /other/list /raw/seqs.fastq without extension /raw/seqs

cat filelist | parallel echo {} "without path" {/} /dir/subdir/file.txt without path file.txt /other/list.csv without path list.csv /raw/seqs.fastq without path seqs.fastq

cat filelist | parallel echo {} "with root" {//} /dir/subdir/file.txt with root /dir/subdir /other/list.csv with root /other /raw/seqs.fastq with root /raw