Lesson 10: Quantifying Gene Expression

Lesson 9 Review

In Lesson 9, participants learned to align the hcc1395 sequencing data to reference genome in order to determine where in the genome each of the sequences came from. The next step is to quantify at each gene, how many sequences mapped to it, which can be used to approximate expression.

Learning objectives

After this lesson, participants will be able to:

Convert SAM files to machine readable BAM.
Quantify gene expression from alignment results.

Sign onto Biowulf

Before getting started, sign onto Biowulf. Remember to replace user with the participant's assigned Biowulf student ID.

ssh user@biowulf.nih.gov

Then, change into the /data/user/hcc1395_b4b folder.

cd /data/user/hcc1395_b4b

Next, request an interactive session with 6 CPUs, 12 gb of RAM or memory and 10 gb of local temporary storage.

sinteractive --cpus-per-task 6 --mem=12gb --gres=lscratch:10

Converting SAM to BAM

SAM files are human readable. Many bioinformatics programs take the binary version of SAM files known as BAM for input. Below, SAMTOOLS will be used to convert the hcc1395 SAM alignment outputs to BAM.

module load samtools

Change into the hcc1395_hisat2 folder from /data/user/hcc1395_b4b and list the contents.

cd hcc1395_hisat2

ls

The following SAM alignment files should be present.

hcc1395_normal_rep1.sam
hcc1395_normal_rep2.sam
hcc1395_normal_rep3.sam
hcc1395_tumor_rep1.sam
hcc1395_tumor_rep2.sam
hcc1395_tumor_rep3.sam

The command construct below can be used to convert the SAM files to BAM and it can be explained as follows:

cat is used to print the hcc1395 sample IDs to the terminal. The sample IDs are stored in a text file called hcc1395_sample_ids.txt one folder up, so it's path can be specified as ../hcc1395_sample_ids.txt. The output from this cat command is sent via pipe (|) to parallel. The samtools command used to convert SAM to BAM is enclosed in parentheses. Within samtools:
sort will sort the alignment output by coordinate (see http://www.htslib.org/doc/samtools-sort.html for more information).
-o: prompts for the output file name, which has the .bam extension and prepended by the sample ID (specified using {} as a place holder) sent by cat.
{}.sam is the input SAM file. Again, {} is used as a place holder for the sample IDs sent by cat.

cat ../hcc1395_sample_ids.txt | parallel -j 6 "samtools sort -o {}.bam {}.sam"

The next step is to index the BAM files. Again, cat and parallel will be used to performing the indexing step for all BAM files at once. The index subcommand of samtools is used for indexing. The option -b states create an index with extension .bai. The input BAM file follows the -b option and the output .bai file name is specified next.

cat ../hcc1395_sample_ids.txt | parallel -j 6 "samtools index -b {}.bam {}.bam.bai"

Do Some File Renaming

All of the files in the alignment output folder have "hcc1395_" prepended. To make these files easier to work with, this exercise will remove the "hcc1395" from these files using the for loop below, which can be explained as follows:

for loops are one of the options for performing repetitive tasks until a certain condition has been met and requires a variable to keep track of the loop's progress.
In the loop below, file will be used as this variable or index, which will go through every file in the hcc1395_hisat2 alignment results folder and rename all of the files such that "hcc1395_" is removed.
in will tell the for loop what range or set of conditions should the loop operate on. For instance, this could be a numeric range (ie. 1 through 10) or in this case, it's simply the wild care *, which specifies for the loop to operate on every file in the hcc1395_hisat2 folder.
do: this statement tells the for loop to do something, which in this example is to set a variable called new to equal the part of the file name without hcc1395_. To set a variable in side the loop:
- Use the = sign to assign value to the variable.
- Enclose the value inside $( () ).
- echo $file will print the name of the file, which will be sent to the cut command via |. In the cut command, extract fields 2-3 of the file name by specifying -f2-3. The delimiter in, specified after -d in this example is _.
Use mv to rename the file, where $file references the current file name in the loop, and $new references the variable new, which is the new for the file.
Close out the for loop with done.

for file in *; 
do new=$( (echo $file | cut -f2-3 -d '_') ); 
mv $file $new; 
done

After the for loop finishes renaming files, list the contents using ls to confirm that the hcc1395_ part has been removed.

Quantifying Gene Expression

The module featureCounts from the subread package will be used to quantify gene expression. In the featureCounts command below:

-p: specifies the presence of paired end data.
--countReadPairs: specifies to count both reads in a pair.
-a: prompt for path the gtf annotation file, which is locate one directory up from the current directory hcc1395_hisat2 (specified as ../) and in the references folder with file name 22.gtf so the path would be ../references/22.gtf.
-g: indicates to report expression by gene name as reported in the gtf file.
-o: prompts for path of output, which is one directory from current (ie. hcc1395_hisat2) and in hcc1395_featurecounts with assigned file name hcc1395_gene_expression.txt. So the path for the output is ../hcc1395_featurecounts/hcc1395_gene_expression.txt.
*.bam: tells featureCounts to count all of the bam alignment output files in hcc1395_hisat2.

featureCounts -p --countReadPairs -a ../references/22.gtf -g gene_name -o ../hcc1395_featurecounts/hcc1395_gene_expression.txt *.bam

head -1 reveals that the first line in the featureCounts gene expression table is the program and command line call used to perform the quantification. This can be removed in order to generate a gene by sample expression matrix.

head -1 hcc1395_gene_expression.txt

# Program:featureCounts v2.0.6; Command:"featureCounts" "-p" "--countReadPairs" "-a" "../references/22.gtf" "-g" "gene_name" "-o" "../hcc1395_featurecounts/hcc1395_gene_expression.txt" "normal_rep1.bam" "normal_rep2.bam" "normal_rep3.bam" "tumor_rep1.bam" "tumor_rep2.bam" "tumor_rep3.bam"

To remove the header in the gene expression table hcc1395_gene_expression.txt, use sed where 1d indicates to delete the first line in the file (ie. the header information). Then > will be use to write the output to hcc1395_gene_expression_no_header.txt.

sed '1d' hcc1395_gene_expression.txt > hcc1395_gene_expression_no_header.txt

Use head -1 to view the first line in hcc1395_gene_expression_no_header.txt. The expression table contains the Geneid (gene names) in the first column, followed by chromosome coordinates and then on column for each sample with the estimated expression.

head -1 hcc1395_gene_expression_no_header.txt

Geneid  Chr    Start     End       Strand  Length  normal_rep1.bam  normal_rep2.bam  normal_rep3.bam  tumor_rep1.bam  tumor_rep2.bam  tumor_rep3.bam
U2      chr22  10736171  10736283  -       113     0                0                0                0               0               0

The chromosomal information in columns 2 thru 6 can be removed using the cut command below where -f prompts for the columns to extract (ie. 1 and 7 thru 12) with the construct being -f1,7-12. The file to extract these columns from (ie. hcc1395_gene_expression_no_header.txt). Then | is used to send the output of cut to tr which will swap out the tab ('\t') that separate the columns with comma essentially turning the gene expression table in to a comma separate value (CSV) file saved under the name hcc1395_gene_expression.csv using >.

cut -f1,7-12 hcc1395_gene_expression_no_header.txt | tr '\t' ',' > hcc1395_gene_expression.csv

The column and head commands can be used to view the first two lines of hcc1395_gene_expression.csv in a way that the columns are nicely aligned. In the column command -t indicates to create a table while -s prompts for specification of column separater (ie. comma for CSV files). The file in which content is to be displayed is hcc1395_gene_expression.csv. The output of column is sent to head with the -2 option to print the first two lines. The output indicates that this table has the Geneid (gen names) in the first column and the subsequent columns contain the expression estimates for each gene for each sample.

column -t -s ',' hcc1395_gene_expression.csv | head -2

Geneid              normal_rep1.bam  normal_rep2.bam  normal_rep3.bam  tumor_rep1.bam  tumor_rep2.bam  tumor_rep3.bam
U2                  0                0                0                0               0               0

Unfortunately, the column headings for the samples have ".bam" in them. These can be removed using awk and gsub. gsub is a command for pattern substitution.

Note

"Awk is a scripting language used for manipulating data and generating reports." -- https://www.geeksforgeeks.org/awk-command-unixlinux-examples/#

gsub will replace all occurences of a patter with another.

The action for awk is enclosed in single quotes and {}. For the awk construct below, the action is gsub, which will find the pattern .bam (enclosed by forward slashes or /) and replace with blank (denoted by ""). It will then print the result. Note that gsub and print are separated by a semicolon. The input file is hcc1395_gene_expression.csv and the results will be written to file hcc1395_gene_expression_matrix.csv.

awk '{gsub(/.bam/, ""); print}' hcc1395_gene_expression.csv > hcc1395_gene_expression_matrix.csv