Lesson 10: Quantifying Gene Expression
Lesson 9 Review
In Lesson 9, participants learned to align the hcc1395 sequencing data to reference genome in order to determine where in the genome each of the sequences came from. The next step is to quantify at each gene, how many sequences mapped to it, which can be used to approximate expression.
Learning objectives
After this lesson, participants will be able to:
- Convert SAM files to machine readable BAM.
- Quantify gene expression from alignment results.
Sign onto Biowulf
Before getting started, sign onto Biowulf. Remember to replace user
with the participant's assigned Biowulf student ID.
Then, change into the /data/user/hcc1395_b4b
folder.
Next, request an interactive session with 6 CPUs, 12 gb of RAM or memory and 10 gb of local temporary storage.
Converting SAM to BAM
SAM files are human readable. Many bioinformatics programs take the binary version of SAM files known as BAM for input. Below, SAMTOOLS will be used to convert the hcc1395 SAM alignment outputs to BAM.
Change into the hcc1395_hisat2
folder from /data/user/hcc1395_b4b
and list the contents.
The following SAM alignment files should be present.
hcc1395_normal_rep1.sam
hcc1395_normal_rep2.sam
hcc1395_normal_rep3.sam
hcc1395_tumor_rep1.sam
hcc1395_tumor_rep2.sam
hcc1395_tumor_rep3.sam
The command construct below can be used to convert the SAM files to BAM and it can be explained as follows:
cat
is used to print the hcc1395 sample IDs to the terminal. The sample IDs are stored in a text file calledhcc1395_sample_ids.txt
one folder up, so it's path can be specified as../hcc1395_sample_ids.txt
. The output from thiscat
command is sent via pipe (|
) toparallel
. Thesamtools
command used to convert SAM to BAM is enclosed in parentheses. Withinsamtools
:sort
will sort the alignment output by coordinate (see http://www.htslib.org/doc/samtools-sort.html for more information).-o
: prompts for the output file name, which has the.bam
extension and prepended by the sample ID (specified using{}
as a place holder) sent bycat
.{}.sam
is the input SAM file. Again,{}
is used as a place holder for the sample IDs sent bycat
.
The next step is to index the BAM files. Again, cat
and parallel
will be used to performing the indexing step for all BAM files at once. The index
subcommand of samtools
is used for indexing. The option -b
states create an index with extension .bai
. The input BAM file follows the -b
option and the output .bai
file name is specified next.
Quantifying Gene Expression
Go back to /data/user/hcc1395_b4b
.
Create a directory called hcc1395_featurecounts
.
Go back into /data/user/hcc1395_b4b/hcc1395_hisat2
for this exercise.
The module featureCounts
from the subread
package will be used to quantify gene expression. So before getting started do the following.
In the featureCounts
command below:
-p
: specifies the presence of paired end data.--countReadPairs
: specifies to count both reads in a pair.-a
: prompt for path thegtf
annotation file, which is locate one directory up from the current directoryhcc1395_hisat2
(specified as../
) and in thereferences
folder with file name22.gtf
, so the path would be../references/22.gtf
.-g
: indicates to report expression by gene name as reported in thegtf
file.-o
: prompts for path to write the output, which is one directory up from the current ofhcc1395_hisat2
and inhcc1395_featurecounts
with assigned file namehcc1395_gene_expression.txt
. So the path for the output is../hcc1395_featurecounts/hcc1395_gene_expression.txt
.*.bam
: tellsfeatureCounts
to count all of thebam
alignment output files inhcc1395_hisat2
.
featureCounts -p --countReadPairs -a ../references/22.gtf -g gene_name -o ../hcc1395_featurecounts/hcc1395_gene_expression.txt *.bam
After featureCounts
has completed, change into the /data/user/hcc1395_b4b/hcc1395_featurecounts
folder, which resides one folder up in /data/user/hcc1395_b4b
.
head -1
reveals that the first line in the featureCounts
gene expression table is the program and command line call used to perform the quantification. This can be removed in order to generate a gene by sample expression matrix.
# Program:featureCounts v2.0.6; Command:"featureCounts" "-p" "--countReadPairs" "-a" "../references/22.gtf" "-g" "gene_name" "-o" "../hcc1395_featurecounts/hcc1395_gene_expression.txt" "hcc1395_normal_rep1.bam" "hcc1395_normal_rep2.bam" "hcc1395_normal_rep3.bam" "hcc1395_tumor_rep1.bam" "hcc1395_tumor_rep2.bam" "hcc1395_tumor_rep3.bam"
To remove the header in the gene expression table hcc1395_gene_expression.txt
, use sed
where the option 1d
indicates to delete the first line in the file (ie. the header information). Then >
will be use to write the output to hcc1395_gene_expression_no_header.txt
.
Use head -1
to view the first line in hcc1395_gene_expression_no_header.txt
. The expression table contains the Geneid (gene names) in the first column, followed by chromosome coordinates and then on column for each sample with the estimated expression.
Geneid Chr Start End Strand Length hcc1395_normal_rep1.bam hcc1395_normal_rep2.bam hcc1395_normal_rep3.bam hcc1395_tumor_rep1.bam hcc1395_tumor_rep2.bam hcc1395_tumor_rep3.bam
U2 chr22 10736171 10736283 - 113 0 0 0 0 0 0
The chromosomal information in columns 2 thru 6 can be removed using the cut
command below where -f
prompts for the columns to extract (ie. gene IDs in column 1 and sample expression estimates in columns 7 thru 12) with the construct being -f1,7-12
. The file to extract these columns from (ie. hcc1395_gene_expression_no_header.txt
) is specified next. Then |
is used to send the output of cut
to tr
which will swap out the tab ('\t') that separate the columns with comma essentially turning the gene expression table in to a comma separate value (CSV) file saved under the name hcc1395_gene_expression.csv
using >
.
The column
and head
commands can be used to view the first two lines of hcc1395_gene_expression.csv
in a way that the columns are nicely aligned. In the column
command -t
indicates to create a table while -s
prompts for specification of column separator (ie. comma for CSV files). The file in which the content is to be displayed is hcc1395_gene_expression.csv
. The output of column
is sent to head
with the -2
option to print the first two lines. The output indicates that this table has the Geneid (gene names) in the first column and the subsequent columns contain the expression estimates for each gene for each sample.