Lesson 10: Quantifying Gene Expression
Lesson 9 Review
In Lesson 9, participants learned to align the hcc1395 sequencing data to reference genome in order to determine where in the genome each of the sequences came from. The next step is to quantify at each gene, how many sequences mapped to it, which can be used to approximate expression.
Learning objectives
After this lesson, participants will be able to:
- Convert SAM files to machine readable BAM.
- Quantify gene expression from alignment results.
Sign onto Biowulf
Before getting started, sign onto Biowulf. Remember to replace user
with the participant's assigned Biowulf student ID.
Then, change into the /data/user/hcc1395_b4b
folder.
Next, request an interactive session with 6 CPUs, 12 gb of RAM or memory and 10 gb of local temporary storage.
Converting SAM to BAM
SAM files are human readable. Many bioinformatics programs take the binary version of SAM files known as BAM for input. Below, SAMTOOLS will be used to convert the hcc1395 SAM alignment outputs to BAM.
Change into the hcc1395_hisat2
folder from /data/user/hcc1395_b4b
and list the contents.
The following SAM alignment files should be present.
hcc1395_normal_rep1.sam
hcc1395_normal_rep2.sam
hcc1395_normal_rep3.sam
hcc1395_tumor_rep1.sam
hcc1395_tumor_rep2.sam
hcc1395_tumor_rep3.sam
The command construct below can be used to convert the SAM files to BAM and it can be explained as follows:
cat
is used to print the hcc1395 sample IDs to the terminal. The sample IDs are stored in a text file calledhcc1395_sample_ids.txt
one folder up, so it's path can be specified as../hcc1395_sample_ids.txt
. The output from thiscat
command is sent via pipe (|
) toparallel
. Thesamtools
command used to convert SAM to BAM is enclosed in parentheses. Withinsamtools
:sort
will sort the alignment output by coordinate (see http://www.htslib.org/doc/samtools-sort.html for more information).-o
: prompts for the output file name, which has the.bam
extension and prepended by the sample ID (specified using{}
as a place holder) sent bycat
.{}.sam
is the input SAM file. Again,{}
is used as a place holder for the sample IDs sent bycat
.
The next step is to index the BAM files. Again, cat
and parallel
will be used to performing the indexing step for all BAM files at once. The index
subcommand of samtools
is used for indexing. The option -b
states create an index with extension .bai
. The input BAM file follows the -b
option and the output .bai
file name is specified next.
Do Some File Renaming
All of the files in the alignment output folder have "hcc1395_" prepended. To make these files easier to work with, this exercise will remove the "hcc1395" from these files using the for
loop below, which can be explained as follows:
for
loops are one of the options for performing repetitive tasks until a certain condition has been met and requires a variable to keep track of the loop's progress.- In the loop below,
file
will be used as this variable or index, which will go through every file in thehcc1395_hisat2
alignment results folder and rename all of the files such that "hcc1395_" is removed. in
will tell thefor
loop what range or set of conditions should the loop operate on. For instance, this could be a numeric range (ie. 1 through 10) or in this case, it's simply the wild care*
, which specifies for the loop to operate on every file in thehcc1395_hisat2
folder.do
: this statement tells thefor
loop to do something, which in this example is to set a variable callednew
to equal the part of the file name withouthcc1395_
. To set a variable in side the loop:- Use the
=
sign to assign value to the variable. - Enclose the value inside $( () ).
echo $file
will print the name of the file, which will be sent to thecut
command via|
. In thecut
command, extract fields2-3
of the file name by specifying-f2-3
. The delimiter in, specified after-d
in this example is_
.
- Use the
- Use
mv
to rename the file, where$file
references the current file name in the loop, and$new
references the variablenew
, which is the new for the file. - Close out the
for
loop withdone
.
After the for
loop finishes renaming files, list the contents using ls
to confirm that the hcc1395_
part has been removed.
Quantifying Gene Expression
The module featureCounts
from the subread
package will be used to quantify gene expression. In the featureCounts
command below:
-p
: specifies the presence of paired end data.--countReadPairs
: specifies to count both reads in a pair.-a
: prompt for path thegtf
annotation file, which is locate one directory up from the current directoryhcc1395_hisat2
(specified as../
) and in thereferences
folder with file name22.gtf
so the path would be../references/22.gtf
.-g
: indicates to report expression by gene name as reported in thegtf
file.-o
: prompts for path of output, which is one directory from current (ie.hcc1395_hisat2
) and inhcc1395_featurecounts
with assigned file namehcc1395_gene_expression.txt
. So the path for the output is../hcc1395_featurecounts/hcc1395_gene_expression.txt
.*.bam
: tellsfeatureCounts
to count all of thebam
alignment output files inhcc1395_hisat2
.
featureCounts -p --countReadPairs -a ../references/22.gtf -g gene_name -o ../hcc1395_featurecounts/hcc1395_gene_expression.txt *.bam
head -1
reveals that the first line in the featureCounts
gene expression table is the program and command line call used to perform the quantification. This can be removed in order to generate a gene by sample expression matrix.
# Program:featureCounts v2.0.6; Command:"featureCounts" "-p" "--countReadPairs" "-a" "../references/22.gtf" "-g" "gene_name" "-o" "../hcc1395_featurecounts/hcc1395_gene_expression.txt" "normal_rep1.bam" "normal_rep2.bam" "normal_rep3.bam" "tumor_rep1.bam" "tumor_rep2.bam" "tumor_rep3.bam"
To remove the header in the gene expression table hcc1395_gene_expression.txt
, use sed
where 1d
indicates to delete the first line in the file (ie. the header information). Then >
will be use to write the output to hcc1395_gene_expression_no_header.txt
.
Use head -1
to view the first line in hcc1395_gene_expression_no_header.txt. The expression table contains the Geneid (gene names) in the first column, followed by chromosome coordinates and then on column for each sample with the estimated expression.
Geneid Chr Start End Strand Length normal_rep1.bam normal_rep2.bam normal_rep3.bam tumor_rep1.bam tumor_rep2.bam tumor_rep3.bam
U2 chr22 10736171 10736283 - 113 0 0 0 0 0 0
The chromosomal information in columns 2 thru 6 can be removed using the cut
command below where -f
prompts for the columns to extract (ie. 1 and 7 thru 12) with the construct being -f1,7-12
. The file to extract these columns from (ie. hcc1395_gene_expression_no_header.txt). Then |
is used to send the output of cut
to tr
which will swap out the tab ('\t') that separate the columns with comma essentially turning the gene expression table in to a comma separate value (CSV) file saved under the name hcc1395_gene_expression.csv using >
.
The column
and head
commands can be used to view the first two lines of hcc1395_gene_expression.csv in a way that the columns are nicely aligned. In the column
command -t
indicates to create a table while -s
prompts for specification of column separater (ie. comma for CSV files). The file in which content is to be displayed is hcc1395_gene_expression.csv
. The output of column
is sent to head
with the -2
option to print the first two lines. The output indicates that this table has the Geneid (gen names) in the first column and the subsequent columns contain the expression estimates for each gene for each sample.
Geneid normal_rep1.bam normal_rep2.bam normal_rep3.bam tumor_rep1.bam tumor_rep2.bam tumor_rep3.bam
U2 0 0 0 0 0 0
Unfortunately, the column headings for the samples have ".bam" in them. These can be removed using awk
and gsub
. gsub
is a command for pattern substitution.
Note
"Awk is a scripting language used for manipulating data and generating reports." -- https://www.geeksforgeeks.org/awk-command-unixlinux-examples/#
gsub
will replace all occurences of a patter with another.
The action for awk
is enclosed in single quotes and {}
. For the awk
construct below, the action is gsub
, which will find the pattern .bam
(enclosed by forward slashes or /) and replace with blank (denoted by ""). It will then print the result. Note that gsub
and print
are separated by a semicolon. The input file is hcc1395_gene_expression.csv
and the results will be written to file hcc1395_gene_expression_matrix.csv
.