Lesson 10: Quantifying Gene Expression from bulk RNA Sequencing Data
Lesson 9 Review
In Lesson 9, participants learned to align the hcc1395 sequencing data to reference genome in order to determine where in the genome each of the sequences came from. The next step is to quantify at each gene, how many sequences mapped to it, which can be used to approximate expression.

Learning objectives
After this lesson, participants will be able to:
- Convert SAM files to machine readable BAM.
- Quantify gene expression from alignment results.
Sign onto Biowulf
Before getting started, sign onto Biowulf. Remember to replace user with the participant's assigned Biowulf student ID.
Then, change into the /data/user/hcc1395_b4b folder.
Next, request an interactive session with 6 CPUs, 12 gb of RAM or memory and 10 gb of local temporary storage.
Converting SAM to BAM
SAM files are human readable. Many bioinformatics programs take the binary version of SAM files known as BAM for input. Below, SAMTOOLS will be used to convert the hcc1395 SAM alignment outputs to BAM.
Change into the hcc1395_hisat2 folder from /data/user/hcc1395_b4b and list the contents.
The following SAM alignment files should be present.
hcc1395_normal_rep1.sam
hcc1395_normal_rep2.sam
hcc1395_normal_rep3.sam
hcc1395_tumor_rep1.sam
hcc1395_tumor_rep2.sam
hcc1395_tumor_rep3.sam
The command construct below can be used to convert the SAM files to BAM and it can be explained as follows:
catis used to print the hcc1395 sample IDs to the terminal. The sample IDs are stored in a text file calledhcc1395_sample_ids.txtone folder up, so it's path can be specified as../hcc1395_sample_ids.txt. The output from thiscatcommand is sent via pipe (|) toparallel. Thesamtoolscommand used to convert SAM to BAM is enclosed in parentheses. Withinsamtools:sortwill sort the alignment output by coordinate (see http://www.htslib.org/doc/samtools-sort.html for more information).-o: prompts for the output file name, which has the.bamextension and prepended by the sample ID (specified using{}as a place holder) sent bycat.{}.samis the input SAM file. Again,{}is used as a place holder for the sample IDs sent bycat.
The next step is to index the BAM files. Again, cat and parallel will be used to performing the indexing step for all BAM files at once. The index subcommand of samtools is used for indexing. The option -b states create an index with extension .bai. The input BAM file follows the -b option and the output .bai file name is specified next.
Quantifying Gene Expression
Go back to /data/user/hcc1395_b4b.
Create a directory called hcc1395_featurecounts.
Go back into /data/user/hcc1395_b4b/hcc1395_hisat2 for this exercise.
The module featureCounts from the subread package will be used to quantify gene expression. So before getting started do the following.
In the featureCounts command below:
-p: specifies the presence of paired end data.--countReadPairs: specifies to count both reads in a pair.-a: prompt for path thegtfannotation file, which is locate one directory up from the current directoryhcc1395_hisat2(specified as../) and in thereferencesfolder with file name22.gtf, so the path would be../references/22.gtf.-g: indicates to report expression by gene name as reported in thegtffile.-o: prompts for path to write the output, which is one directory up from the current ofhcc1395_hisat2and inhcc1395_featurecountswith assigned file namehcc1395_gene_expression.txt. So the path for the output is../hcc1395_featurecounts/hcc1395_gene_expression.txt.*.bam: tellsfeatureCountsto count all of thebamalignment output files inhcc1395_hisat2.
featureCounts -p --countReadPairs -a ../references/22.gtf -g gene_name -o ../hcc1395_featurecounts/hcc1395_gene_expression.txt *.bam
After featureCounts has completed, change into the /data/user/hcc1395_b4b/hcc1395_featurecounts folder, which resides one folder up in /data/user/hcc1395_b4b.
head -1 reveals that the first line in the featureCounts gene expression table is the program and command line call used to perform the quantification. This can be removed in order to generate a gene by sample expression matrix.
# Program:featureCounts v2.0.6; Command:"featureCounts" "-p" "--countReadPairs" "-a" "../references/22.gtf" "-g" "gene_name" "-o" "../hcc1395_featurecounts/hcc1395_gene_expression.txt" "hcc1395_normal_rep1.bam" "hcc1395_normal_rep2.bam" "hcc1395_normal_rep3.bam" "hcc1395_tumor_rep1.bam" "hcc1395_tumor_rep2.bam" "hcc1395_tumor_rep3.bam"
To remove the header in the gene expression table hcc1395_gene_expression.txt, use sed where the option 1d indicates to delete the first line in the file (ie. the header information). Then > will be use to write the output to hcc1395_gene_expression_no_header.txt.
Use head -1 to view the first line in hcc1395_gene_expression_no_header.txt. The expression table contains the Geneid (gene names) in the first column, followed by chromosome coordinates and then on column for each sample with the estimated expression.
Geneid Chr Start End Strand Length hcc1395_normal_rep1.bam hcc1395_normal_rep2.bam hcc1395_normal_rep3.bam hcc1395_tumor_rep1.bam hcc1395_tumor_rep2.bam hcc1395_tumor_rep3.bam
U2 chr22 10736171 10736283 - 113 0 0 0 0 0 0
The chromosomal information in columns 2 thru 6 can be removed using the cut command below where -f prompts for the columns to extract (ie. gene IDs in column 1 and sample expression estimates in columns 7 thru 12) with the construct being -f1,7-12. The file to extract these columns from (ie. hcc1395_gene_expression_no_header.txt) is specified next. Then | is used to send the output of cut to tr which will swap out the tab ('\t') that separate the columns with comma essentially turning the gene expression table in to a comma separate value (CSV) file saved under the name hcc1395_gene_expression.csv using >.
The column and head commands can be used to view the first two lines of hcc1395_gene_expression.csv in a way that the columns are nicely aligned. In the column command -t indicates to create a table while -s prompts for specification of column separator (ie. comma for CSV files). The file in which the content is to be displayed is hcc1395_gene_expression.csv. The output of column is sent to head with the -2 option to print the first two lines. The output indicates that this table has the Geneid (gene names) in the first column and the subsequent columns contain the expression estimates for each gene for each sample.