Skip to content

Lesson 12 Practice

Objectives

In this practice session, we will work with something new, which is a dataset from the Griffith lab RNA sequencing tutorial. Here, we will have a chance to practice what we have learned up until this point of the course series including

  • Creating of new directory
  • Copying of files
  • Assaying sequencing data quality
  • Trimming of sequencing data

Where is my data

This data set is from the Griffith lab RNA sequencing tutorial and are kept at http://genomedata.org/rnaseq-tutorial/practical.tar. This dataset is derived from a study that profiled the transcriptome of HCC1395 breast cancer cell line and the HCC1395BL lymphoblastoid line, which makes this a tumor versus normal transcriptome comparison -- Griffith lab.

To begin to work with this dataset, create a directory called hcc1395 in the ~/biostar_class folder and then change into it. See if you can do this.

Solution

Make sure we are in the ~/biostar_class folder

pwd

If not, then change into it

cd ~/biostar_class

Create the hcc1395 folder

mkdir hcc1395

Then change into the hcc1395 folder

cd hcc1395

Next, we will need to download the data, which lives at http://genomedata.org/rnaseq-tutorial/practical.tar. How do we do this? What is the format (ie. extension) of the downloaded file and how do we unpack it?

Solution

We can either use wget or curl to download

wget http://genomedata.org/rnaseq-tutorial/practical.tar

If using curl, remember to specify the output

curl http://genomedata.org/rnaseq-tutorial/practical.tar -o practical.tar

After downloading, list the directory content in the long view to make sure we successfully downloaded something.

ls -l 
total 355132
-rw-rw-r-- 1 joe joe 363653120 Oct 23  2018 practical.tar

We have downloaded an archive (tar) of the practice data. We can use the tar command to unpack.

tar -xvf practical.tar

List the directory again to see what was added.

ls -l

We have the fastq files in fastq.gz format (ie. they are gzipped but we are will not be unzipping these as our tools can work with the gzipped versions).

total 710268
-rw-rw-r-- 1 joe joe  25955505 Mar 18  2017 hcc1395_normal_rep1_r1.fastq.gz
-rw-rw-r-- 1 joe joe  30766759 Mar 18  2017 hcc1395_normal_rep1_r2.fastq.gz
-rw-rw-r-- 1 joe joe  25409781 Mar 18  2017 hcc1395_normal_rep2_r1.fastq.gz
-rw-rw-r-- 1 joe joe  30213083 Mar 18  2017 hcc1395_normal_rep2_r2.fastq.gz
-rw-rw-r-- 1 joe joe  25132378 Mar 18  2017 hcc1395_normal_rep3_r1.fastq.gz
-rw-rw-r-- 1 joe joe  30174637 Mar 18  2017 hcc1395_normal_rep3_r2.fastq.gz
-rw-rw-r-- 1 joe joe  30361801 Mar 18  2017 hcc1395_tumor_rep1_r1.fastq.gz
-rw-rw-r-- 1 joe joe  35887220 Mar 18  2017 hcc1395_tumor_rep1_r2.fastq.gz
-rw-rw-r-- 1 joe joe  29769613 Mar 18  2017 hcc1395_tumor_rep2_r1.fastq.gz
-rw-rw-r-- 1 joe joe  35254974 Mar 18  2017 hcc1395_tumor_rep2_r2.fastq.gz
-rw-rw-r-- 1 joe joe  29472281 Mar 18  2017 hcc1395_tumor_rep3_r1.fastq.gz
-rw-rw-r-- 1 joe joe  35241854 Mar 18  2017 hcc1395_tumor_rep3_r2.fastq.gz
-rw-rw-r-- 1 joe joe 363653120 Oct 23  2018 practical.tar

Getting to know the data

Can you print the first sequencing read (from header line to quality score line) in hcc1395_normal_rep1_r1.fastq.gz?

Solution

zcat hcc1395_normal_rep1_r1.fastq.gz | head -4

How many sequencing reads are in hcc1395_normal_rep1_r1.fastq.gz?

Solution

zcat hcc1395_normal_rep1_r1.fastq.gz | grep @ | wc -l

331958

## Assessing sequencing data quality

For this portion of the practice, create a folder called qc within ~/biostar_class/hcc1395 (which should be the present working directory) to store the FASTQC outputs.

Solution

mkdir qc
cd qc

Can you generate quality reports for the FASTQ files in this dataset?

Solution

We use -o to specify output directory in the FASTQC command below. Since we want the FASTQC output to be written in the present working directory (which is ~/biostar_class/hcc1395/qc) we can just use "." to denote this ("." means here in the current directory).

fastqc -o . ../hcc1395_normal*.fastq.gz ../hcc1395_tumor*.fastq.gz

How do you view the FASTQC reports?

Solution

Copy them to the ~/public directory

cp *.html ~/public

Examine the FASTQC reports. How is the quality of the sequencing data? Are there any adapter contaminations?

Solution

While some of the sequence quality scores were not great especially in the read 2 files, we definitely need to trim for adapters.

Can you merge the FASTQC files for this dataset into an MultiQC report?

Solution

multiqc --filename multiqc_report_hcc1395 .

Trimming

For this exercise, go back to the ~/biostar_class/hcc1395 folder and create a new directory called trimmed_data.

Solution

cd ~/biostar_class/hcc1395
mkdir trimmed_data
cd trimmed_data

The adapters for this dataset can be obtained at http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa so first download these.

Solution

wget http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa

or

curl http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa -o illumina_multiplex.fa

Let's trim using the FASTQ files for hcc1395_normal_rep1 using the following thresholds:

  • Input files are
    • hcc1395_normal_rep1_r1.fastq.gz
    • hcc1395_normal_rep1_r2.fastq.gz
  • Quality - use SLIDINGWINDOW of 4 reads and a quality threshold of 30
  • Adapter - use ILLUMINACLIP:illumina_multiplex.fa:2:30:5
  • Minimum sequence length to keep - MINLEN:50

Solution

trimmomatic PE ../hcc1395_normal_rep1_r1.fastq.gz ../hcc1395_normal_rep1_r2.fastq.gz hcc1395_normal_rep1_trimmed_r1.fastq.gz hcc1395_normal_rep1_trimmed_unpair_r1.fastq.gz hcc1395_normal_rep1_trimmed_r2.fastq.gz hcc1395_normal_rep1_trimmed_unpair_r2.fastq.gz SLIDINGWINDOW:4:30 ILLUMINACLIP:illumina_multiplex.fa:2:30:5 MINLEN:50

QC reports for hcc1395 dataset

Pre-trimming

FASTQC report for hcc1395_normal_rep1_r1.fastq.gz

FASTQC report for hcc1395_normal_rep1_r2.fastq.gz

FASTQC report for hcc1395_normal_rep2_r1.fastq.gz

FASTQC report for hcc1395_normal_rep2_r2.fastq.gz

FASTQC report for hcc1395_normal_rep3_r1.fastq.gz

FASTQC report for hcc1395_normal_rep3_r2.fastq.gz

FASTQC report for hcc1395_tumor_rep1_r1.fastq.gz

FASTQC report for hcc1395_tumor_rep1_r2.fastq.gz

FASTQC report for hcc1395_tumor_rep2_r1.fastq.gz

FASTQC report for hcc1395_tumor_rep2_r2.fastq.gz

FASTQC report for hcc1395_tumor_rep3_r1.fastq.gz

FASTQC report for hcc1395_tumor_rep3_r2.fastq.gz

MultiQC report for hcc1395

Post-trimming

FASTQC report for hcc1395_normal_rep1_trimmed_r1.fastq.gz

FASTQC report for hcc1395_normal_rep1_trimmed_r2.fastq.gz