Lesson 10: Quantifying Gene Expression from bulk RNA Sequencing Data

Lesson 9 Review

In Lesson 9, participants learned to align the hcc1395 sequencing data to reference genome in order to determine where in the genome each of the sequences came from. The next step is to quantify at each gene, how many sequences mapped to it, which can be used to approximate expression.

Learning objectives

After this lesson, participants will be able to:

Convert SAM files to machine readable BAM.
Quantify gene expression from alignment results.

Sign onto Biowulf

Before getting started, sign onto Biowulf. Remember to replace user with the participant's assigned Biowulf student ID.

ssh user@biowulf.nih.gov

Then, change into the /data/user/hcc1395_b4b folder.

cd /data/user/hcc1395_b4b

Next, request an interactive session with 6 CPUs, 12 gb of RAM or memory and 10 gb of local temporary storage.

sinteractive --cpus-per-task 6 --mem=12gb --gres=lscratch:10

Converting SAM to BAM

SAM files are human readable. Many bioinformatics programs take the binary version of SAM files known as BAM for input. Below, SAMTOOLS will be used to convert the hcc1395 SAM alignment outputs to BAM.

module load samtools

Change into the hcc1395_hisat2 folder from /data/user/hcc1395_b4b and list the contents.

cd hcc1395_hisat2

ls

The following SAM alignment files should be present.

hcc1395_normal_rep1.sam
hcc1395_normal_rep2.sam
hcc1395_normal_rep3.sam
hcc1395_tumor_rep1.sam
hcc1395_tumor_rep2.sam
hcc1395_tumor_rep3.sam

The command construct below can be used to convert the SAM files to BAM and it can be explained as follows:

cat is used to print the hcc1395 sample IDs to the terminal. The sample IDs are stored in a text file called hcc1395_sample_ids.txt one folder up, so it's path can be specified as ../hcc1395_sample_ids.txt. The output from this cat command is sent via pipe (|) to parallel. The samtools command used to convert SAM to BAM is enclosed in parentheses. Within samtools:
sort will sort the alignment output by coordinate (see http://www.htslib.org/doc/samtools-sort.html for more information).
-o: prompts for the output file name, which has the .bam extension and prepended by the sample ID (specified using {} as a place holder) sent by cat.
{}.sam is the input SAM file. Again, {} is used as a place holder for the sample IDs sent by cat.

cat ../hcc1395_sample_ids.txt | parallel -j 6 "samtools sort -o {}.bam {}.sam"

The next step is to index the BAM files. Again, cat and parallel will be used to performing the indexing step for all BAM files at once. The index subcommand of samtools is used for indexing. The option -b states create an index with extension .bai. The input BAM file follows the -b option and the output .bai file name is specified next.

cat ../hcc1395_sample_ids.txt | parallel -j 6 "samtools index -b {}.bam {}.bam.bai"

Quantifying Gene Expression

Go back to /data/user/hcc1395_b4b.

cd /data/user/hcc1395_b4b

Create a directory called hcc1395_featurecounts.

mkdir hcc1395_featurecounts

Go back into /data/user/hcc1395_b4b/hcc1395_hisat2 for this exercise.

The module featureCounts from the subread package will be used to quantify gene expression. So before getting started do the following.

module load subread

In the featureCounts command below:

-p: specifies the presence of paired end data.
--countReadPairs: specifies to count both reads in a pair.
-a: prompt for path the gtf annotation file, which is locate one directory up from the current directory hcc1395_hisat2 (specified as ../) and in the references folder with file name 22.gtf, so the path would be ../references/22.gtf.
-g: indicates to report expression by gene name as reported in the gtf file.
-o: prompts for path to write the output, which is one directory up from the current of hcc1395_hisat2 and in hcc1395_featurecounts with assigned file name hcc1395_gene_expression.txt. So the path for the output is ../hcc1395_featurecounts/hcc1395_gene_expression.txt.
*.bam: tells featureCounts to count all of the bam alignment output files in hcc1395_hisat2.

featureCounts -p --countReadPairs -a ../references/22.gtf -g gene_name -o ../hcc1395_featurecounts/hcc1395_gene_expression.txt *.bam

After featureCounts has completed, change into the /data/user/hcc1395_b4b/hcc1395_featurecounts folder, which resides one folder up in /data/user/hcc1395_b4b.

cd ../hcc1395_featurecounts

head -1 reveals that the first line in the featureCounts gene expression table is the program and command line call used to perform the quantification. This can be removed in order to generate a gene by sample expression matrix.

head -1 hcc1395_gene_expression.txt

# Program:featureCounts v2.0.6; Command:"featureCounts" "-p" "--countReadPairs" "-a" "../references/22.gtf" "-g" "gene_name" "-o" "../hcc1395_featurecounts/hcc1395_gene_expression.txt" "hcc1395_normal_rep1.bam" "hcc1395_normal_rep2.bam" "hcc1395_normal_rep3.bam" "hcc1395_tumor_rep1.bam" "hcc1395_tumor_rep2.bam" "hcc1395_tumor_rep3.bam"

To remove the header in the gene expression table hcc1395_gene_expression.txt, use sed where the option 1d indicates to delete the first line in the file (ie. the header information). Then > will be use to write the output to hcc1395_gene_expression_no_header.txt.

sed '1d' hcc1395_gene_expression.txt > hcc1395_gene_expression_no_header.txt

Use head -1 to view the first line in hcc1395_gene_expression_no_header.txt. The expression table contains the Geneid (gene names) in the first column, followed by chromosome coordinates and then on column for each sample with the estimated expression.

head -1 hcc1395_gene_expression_no_header.txt

Geneid  Chr    Start     End       Strand  Length  hcc1395_normal_rep1.bam  hcc1395_normal_rep2.bam  hcc1395_normal_rep3.bam  hcc1395_tumor_rep1.bam  hcc1395_tumor_rep2.bam  hcc1395_tumor_rep3.bam
U2      chr22  10736171  10736283  -       113     0                0                0                0               0               0

The chromosomal information in columns 2 thru 6 can be removed using the cut command below where -f prompts for the columns to extract (ie. gene IDs in column 1 and sample expression estimates in columns 7 thru 12) with the construct being -f1,7-12. The file to extract these columns from (ie. hcc1395_gene_expression_no_header.txt) is specified next. Then | is used to send the output of cut to tr which will swap out the tab ('\t') that separate the columns with comma essentially turning the gene expression table in to a comma separate value (CSV) file saved under the name hcc1395_gene_expression.csv using >.

cut -f1,7-12 hcc1395_gene_expression_no_header.txt | tr '\t' ',' > hcc1395_gene_expression.csv

The column and head commands can be used to view the first two lines of hcc1395_gene_expression.csv in a way that the columns are nicely aligned. In the column command -t indicates to create a table while -s prompts for specification of column separator (ie. comma for CSV files). The file in which the content is to be displayed is hcc1395_gene_expression.csv. The output of column is sent to head with the -2 option to print the first two lines. The output indicates that this table has the Geneid (gene names) in the first column and the subsequent columns contain the expression estimates for each gene for each sample.

column -t -s ',' hcc1395_gene_expression.csv | head -2

Geneid              hcc1395_normal_rep1.bam  hcc1395_normal_rep2.bam  hcc1395_normal_rep3.bam  hcc1395_tumor_rep1.bam  hcc1395_tumor_rep2.bam  hcc1395_tumor_rep3.bam
U2                  0                0                0                0               0               0