03. Working with FASTQ and FASTA data copy
The bulk RNA-Seq test data we've been working with is in FASTQ format. We'd like to do a BLAST search on a couple of these sequences. Data must be in FASTA format to do BLAST. How can we transform FASTQ to FASTA format?
We will continue working in the directory we already created for this data.
cd biostar_class
cd RNA_Seq
ls
We can use the EMBOSS Seqret program. This tool is also available on the web where you can copy and paste sequences into a graphical user interface (GUI). For working with very large files however, we will want to work at the command line. To use the "seqret" tool, the command line must be in this format, where "-sequence" is followed by the name of the input sequence, and "-outseq" is followed by the name of the output file.
seqret -sequence reads.fastq -outseq reads.fasta
To see all the options available for the "seqret" program, just type this at the command line.
seqret --help
and to see even more...
seqret --help --verbose
In this example, "reads.fastq" is the input file and "reads.fasta" is the output file. Let's try this on some of our bulk RNA-Seq test data. First, we need to "gunzip" one of the files, as "seqret" can not work with compressed files. Let's go ahead and gunzip one of the smaller ".fastq.gz" files. Check the file sizes of the fastq files with
ls -l
Now do the "gunzip" step on one of the smaller files.
gunzip HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
Note the size of this file before (7.3 M) and after (30M) doing the "gunzip" command. We can look inside this file with the "less" command. By "piping" it through the head command again, we can see just the first eight lines of the file.
less HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq | head -n 8
to see the results (in .fastq format)
@HWI-ST718_146963544:8:1212:5958:93757/1
CTGATTTAACAGCAACACATTTGACTTCAATATTATGGCAAATGAGTCCAAAGTCATACTGCCCCTCCCGTTTGGTATTTTAAGTACAGTAATAGACAAA
+
CCCFFFFFHHHHHJJIIJJJJJJJJJJJJJJIIJIJJJJIJJJJJJIIJJJJJHHIJJJIIIJJJJJJJJGHJHHHEHFFFFFFEFDEEFECFEDDDDDD
@HWI-ST718_146963544:7:2308:7250:88065/1
CGTCGATGTATGCACTCATTATTAGATCCTCAGTATGTATGGTTTCAGCTATGAATGAAAGCATTGCCTCCTTTCTTGTTGACCTGAGTTTACTAAGTAA
+
@C@DFFFAFHGFHIIIIAHIIIIIIIIGIIEHGHHGG<GGACHDHGCHGGEIIIIIIIIIIFIIIIIIIIGIGIIHGIGIFFHHHHFF@DFBEEDCC@AC
Now we can run the "seqret" command to transform the .fastq file to .fasta
seqret -sequence HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq -outseq HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fasta
Let's peek inside our .fasta output file with cat, like we did with the compressed .fastq file. This time we can use the "cat" command, since the file is not compressed. (The "less" command would also work here.)
cat HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fasta | head -n 8
Now we can see the first eight lines of the .fasta file.
>HWI-ST718_146963544:8:1212:5958:93757/1
CTGATTTAACAGCAACACATTTGACTTCAATATTATGGCAAATGAGTCCAAAGTCATACT
GCCCCTCCCGTTTGGTATTTTAAGTACAGTAATAGACAAA
>HWI-ST718_146963544:7:2308:7250:88065/1
CGTCGATGTATGCACTCATTATTAGATCCTCAGTATGTATGGTTTCAGCTATGAATGAAA
GCATTGCCTCCTTTCTTGTTGACCTGAGTTTACTAAGTAA
>HWI-ST718_146963544:8:1210:13422:9826/1
AGTTGGGGTCCTAAGCCAGAAGTTAACTATGCTTCATATATTCTTGCAAGTAGAAGTACA
Another useful tool for working with FASTQ files is seqkit. There are many examples of how to use this tool in the Biostar handbook, we'll review a couple of them here. To see all the available commands for "seqkit"...
seqkit --help
Here is a selection of "seqkit" commands.
Usage:
seqkit [command]
Available Commands:
amplicon retrieve amplicon (or specific region around it) via primer(s)
bam monitoring and online histograms of BAM record features
common find common sequences of multiple files by id/name/sequence
concat concatenate sequences with same ID from multiple files
convert convert FASTQ quality encoding between Sanger, Solexa and Illumina
duplicate duplicate sequences N times
faidx create FASTA index file and extract subsequence
fish look for short sequences in larger sequences using local alignment
fq2fa convert FASTQ to FASTA
fx2tab convert FASTA/Q to tabular format (with length/GC content/GC skew)
genautocomplete generate shell autocompletion script
grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed
head print first N FASTA/Q records
help Help about any command
locate locate subsequences/motifs, mismatch allowed
mutate edit sequence (point mutation, insertion, deletion)
range print FASTA/Q records in a range (start:end)
rename rename duplicated IDs
Let's retrieve the data as shown in the Biostar Handbook (Advanced FASTQ processing). We'll use the "curl" command instead of "wget", and specify an output file using "-o" (output).
curl http://data.biostarhandbook.com/reads/duplicated-reads.fq.gz -o duplicated-reads.fq.gz
Do this for the next two data sets. Here's something a little different, we're retrieving from an "ftp" site (ftp = file transfer protocol) instead of an "http" site.
curl ftp://ftp.ncbi.nih.gov/refseq/release/viral/viral.2.1.genomic.fna.gz -o genomic.fna.gz
curl ftp://ftp.ncbi.nih.gov/refseq/release/viral/viral.2.protein.faa.gz -o protein.faa.gz
Let's run the "stat" function to get some information about these files we've downloaded.
seqkit stat *.gz
Here are the results.
file format type num_seqs sum_len min_len avg_len max_len
duplicated-reads.fq.gz FASTQ DNA 15,000 1,515,000 101 101 101
genomic.fna.gz FASTA DNA 8,431 212,623,662 200 25,219.3 2,243,109
protein.faa.gz FASTA Protein 247,311 64,386,104 7 260.3 8,573
Here's how we can get the GC content of those files.
seqkit fx2tab --name --only-id --gc *.gz | head
And here are the results.
SRR1972739.1 30.69
SRR1972739.2 49.50
SRR1972739.3 41.58
SRR1972739.4 51.49
SRR1972739.5 48.51
SRR1972739.6 31.68
SRR1972739.7 38.61
SRR1972739.8 36.63
SRR1972739.9 40.59
SRR1972739.10 31.68