Skip to content

Lesson 13 Practice

In this practice session, participants will filter and perform some quality checks on the HBR-UHR gene expression data.

Before getting started, make sure to be connected to Biowulf and an interactive session with 12 gb of memory and 10 gb of local temporary storage is created.

Change into the /data/user/hbr_uhr_b4b folder.

Solution
cd /data/user/hbr_uhr_b4b

Create directory called hbr_uhr_deg to store the differential expression analysis outputs.

Solution
mkdir hbr_uhr_deg

What is folder is the HBR-UHR gene expression data table stored and what is the file name?

Solution The gene expression table is stored in the folder `hbr_uhr_expression/`. The name of the gene expression table is `hbr_uhr_gene_expression.csv`. To reference this from `hbr_uhr_b4b` use `hbr_uhr_expression/hbr_uhr_gene_expression.csv`.

Load R.

Solution
module load R

The scripts are the in the folder b4b_script

Filter low expressing genes out of hbr_uhr_gene_expression.csv. Set the minimum number of samples per group that have greater than 0 expression to be 2. Assign hbr_uhr to the study name and write the output to hbr_uhr_deg.

Solution
Rscript b4b_scripts/filter_expression.R hbr_uhr_expression/hbr_uhr_gene_expression.csv 2 hbr_uhr_phenotypes.csv hbr_uhr hbr_uhr_deg

How many genes are in the filtered expression table?

Solution
wc -l hbr_uhr_deg/hbr_uhr_gene_expression_filtered.csv
Since the filter gene expression CSV file has 562 lines then this means 561 genes are left after filtering.

Run QC on the filtered expression data. Assign hbr_uhr as the study name and write the output to the folder hbr_uhr_deg.

Solution
Rscript b4b_scripts/quality_check.R hbr_uhr_deg/hbr_uhr_gene_expression_filtered.csv hbr_uhr hbr_uhr_deg

After running QC, download the images to the Downloads folder of personal computer to view the results. To do this, open and a new terminal (mac) or command prompt window (Windows 10 or above) and change into the local Downloads folder.

cd Downloads

Then use the scp command construct below to download. Remember to replace user with the participants assigned Biowulf student ID.

scp -r user@helix.nih.gov:/data/user/hbr_uhr_b4b/hbr_uhr_deg .

Use the Mac Finder or Windows Explorer to navigate to hbr_uhr_deg in the local Downloads directory to begin exploring the results.

Does it look like the samples are separated by biology?

Solution Yes, it appears that biology is driving the difference between the HBR and UHR samples, which are separated along the first principal component axis.

What is the distance plot informing of?

Solution The distance plot is not really optimal as samples within group are not as close together.

How does the expression distribution among samples look?

Solution From the density and box plots, the expression distribution are not equal among samples. Hopefully normalization improves this.