Managing Bioinformatics Projects with Jupyter Lab

Learning Objectives

After this class, participants will have obtained the foundation needed to start using Jupyter Lab as an all-in-one place to maintain code, output, and other description of analysis steps. Participants will be able to

Start a Jupyter Lab session
Describe the Jupyter Lab interface
Initiate a Jupyter Notebook
Know how to access Jupyter Lab
Know how to create formatted text and code in a Jupyter Notebook
Describe ways to export and share a Jupyter Notebook

Course recap

So far, this course series has addressed several areas that are important for anyone venturing into bioinformatics. These include:

Available bioinformatics resources and tools at NIH
Various assays to measure the different components of the Central Dogma of Biology (ie. whole genome sequencing, RNA sequencing)
Data management
High performance computing systems (Biowulf at NIH)
Programming languages such as R and Python

Start Jupyter Lab

To start Jupyter Lab, type the following into the command prompt. The --no-browser option prevents a web browser from opening.

jupyter lab --no-browser

Copy and paste any of the following URLs into a web browser to start using Jupyter. These URLs will be different for every Jupyter Lab session.

To access the server, open this file in a browser:
        file:///Users/wuz8/Library/Jupyter/runtime/jpserver-30985-open.html
    Or copy and paste one of these URLs:
        http://localhost:8890/lab?token=1952fcce201164f8368f2666f2f2625c0bbeeea1ddc20b2c
        http://127.0.0.1:8890/lab?token=1952fcce201164f8368f2666f2f2625c0bbeeea1ddc20b2c

Tip

Start a Jupyter Lab session in the project folder. This folder will contain input and analysis output.

Jupyter Lab interface

Jupyter Lab is compatible with many languages

Bash, Python, and R
See https://github.com/jupyter/jupyter/wiki/Jupyter-kernels for a list of Jupyter compatible languages

Jupyter Lab file explorer

Start a Python Jupyter Notebook

Click on the "Python 3 (ipykernel)" tab to start a Python Jupyter Notebook. The Jupyter Notebook is a part of Jupyter Lab.

The note book is where users

Write code
View output
Document analysis steps using formatted text written in markdown

Note

"Markdown is a lightweight markup language for creating formatted text using a plain-text editor." -- https://en.wikipedia.org/wiki/Markdown

Keeping code, output and analysis steps all in one place

Note

A new Jupyter Notebook is given the name "Untitled". Change this to something meaningful either using the save icon on the notebook menu bar or right-clicking on the "Untitled" notebook in the file explorer and choose "Rename". Jupyter Notebooks have extension ipynb, which stands for interactive Python notebook.

Changing between markdown and code

Ways to access and use Jupyter Lab

Install on local machine (see https://github.com/jupyter/jupyter/wiki/Jupyter-kernels)
Available on Biowulf (see https://hpc.nih.gov/apps/jupyter.html)

Writing formatted text

See https://www.markdownguide.org/basic-syntax/ for a markdown guide.

Custom heading sizes

Use # to specify heading level

# Heading level 1 (largest)
## Heading level 2 (second largest)
### Heading level 3 (third largest)
...

Lists

Un-ordered lists: use * or -

- DNA
- RNA
- protein
- metabolite

Ordered list: use numbers

1. Obtain sequencing data
2. Perform pre-alignment QC
3. Adapter and/or quality trim
3. Align sequencing data to reference genome
4. Obtain gene expression count matrix
5. Run differential expression analysis
6. Pathway analysis

Insert images

<img src="image_path" />

Insert links

[Description of website](insert url)

Code and visualization

Import data using Pandas

Pandas is a Python package used for working with tabular data. The dataset used here is the differential gene expression analysis results from the HBR and UHR study. To work with this, users will need to import it using the read_csv function of Pandas as the data is in a csv file (hbr_uhr_deg_chr22_with_significance.csv located in the folder jupyter_summer_series_2023_data). The path to this file is used as the argument for the read_csv function.

# Load the Pandas package
import pandas

# Import the data

hbr_uhr_deg_chr22=pandas.read_csv("./jupyter_summer_series_2023_data/hbr_uhr_deg_chr22_with_significance.csv")

# View the first several lines of hbr_uhr_deg_chr22
hbr_uhr_deg_chr22.head()

	name	log2FoldChange	PAdj	-log10PAdj	significance
0	SYNGR1	-4.6	5.200000e-217	216.283997	down
1	SEPT3	-4.6	4.500000e-204	203.346787	down
2	YWHAH	-2.5	4.700000e-191	190.327902	down
3	RPL3	1.7	5.400000e-134	133.267606	ns
4	PI4KA	-2.0	2.900000e-118	117.537602	down

Construct volcano plot using Seaborn

Seaborn is a popular visualization package for Python. Users can use its scatterplot function to generate scatter plots (in this case a volcano plot, which is a special type of scatter plot). The scatterplot function will take on arguments:

Data: hbr_uhr_deg_chr22 (differential gene expression analysis results)
x: x-axis values (ie. gene expression log2FoldChange)
y: y-axis values (ie. -log10 of adjusted p-value)
hue: color dots by whether gene expression is up, down, or has no change (see signifcance column of the data)

# Load the seaborn plotting package
import seaborn

seaborn.scatterplot(hbr_uhr_deg_chr22,x="log2FoldChange", y="-log10PAdj", hue="significance")

<Axes: xlabel='log2FoldChange', ylabel='-log10PAdj'>

The volcano plot is a special scatter plot that depicts gene expression change versus the statistical significance of the change.

Construct heatmap using Seaborn

This exercise will use Seaborn's clustermap function to construct a gene expression heatmap of top differentially expressed genes in the HBR and UHR study. Heatmaps are another common visualization in RNA sequencing and allow users to identify clusters of samples with similar gene expression patterns.

First, import the dataset using pandas.read_csv. The clustermap function of seaborn takes the following arguments and options.

Data: hbr_uhr_top_deg_normalized_counts
z_score: z-score scale the gene expression counts
cmap: specify a coloring scheme (ie. viridis)
figsize: specify figure size
cbar_kws: specify the title for the heatmap color bar using a key-value pair
cbar_pos: specify coordinate to place the heatmap color bar

# Import the data

hbr_uhr_top_deg_normalized_counts=pandas.read_csv("./jupyter_summer_series_2023_data/hbr_uhr_top_deg_normalized_counts.csv", index_col=0)

seaborn.clustermap(hbr_uhr_top_deg_normalized_counts,z_score=0,cmap="viridis",
                   figsize=(8,8),vmin=-1.5, vmax=1.5,cbar_kws=({"label": "z score"}),
                   cbar_pos=(0.855,0.8,0.025,0.15))

<seaborn.matrix.ClusterGrid at 0x1a3bd2190>

R code in a Python Jupyter Notebook

Using the rpy2.ipython package, users can run R code inside a Python Jupyter Notebook.

# Load rpy2.ipython
%load_ext rpy2.ipython

Using R to generate a principal components plot

Here, R will be used to generate principal components plot for the HBR and UHR study. Principal components plots are a popular way to visualize how samples in RNA sequencing cluster based on gene expression.

%%R
# Load packages using the library command
library(ggfortify)

Loading required package: ggplot2

%%R
# Import gene expression data using read.csv and store it as variable counts
counts <- read.csv("./jupyter_summer_series_2023_data/hbr_uhr_normalized_counts_pca.csv")

%%R
# Look at the first few lines of counts
head(counts)

    Samples Treatment SULT4A1 MPPED1 PRAME IGLC2 IGLC3 CDC45 CLDN5 PCAT14
1 HBR_1.bam       HBR   375.0  157.8   0.0   0.0   0.0   2.6  77.6    0.0
2 HBR_2.bam       HBR   343.6  158.4   0.0   0.0   0.0   1.0  88.5    0.0
3 HBR_3.bam       HBR   339.4  162.6   0.0   0.0   0.0   0.0  67.2    1.2
4 UHR_1.bam       UHR     3.5    0.7 568.9 488.6 809.7 155.0   1.4  139.8
5 UHR_2.bam       UHR     6.9    3.0 467.3 498.0 313.8 152.5   2.0  154.4
6 UHR_3.bam       UHR     2.6    2.6 519.2 457.5 688.0 149.9   0.0  155.1
  RP5.1119A7.17 MYO18B RP3.323A16.1 CACNG2
1          53.0    0.0          0.0   42.7
2          57.6    0.0          0.0   35.0
3          51.9    0.0          1.2   56.6
4           0.0   59.5         51.9    0.0
5           0.0   84.2         76.2    1.0
6           0.0   56.5         53.1    0.0

The autoplot command takes the following arguments

Principal components analysis results, which are stored in hbr_uhr_pca
data: Expression counts table, which is stored as counts
colour: column in the expression counts table to color the samples by (here color by Treatment)
size: specify size of the dots

The layer theme was added to the principal components plot to customize the font sizes.

%%R

# Run principal components analysis on counts using the prcomp function
hbr_uhr_pca <- prcomp(counts[3:14],scale.=TRUE,center=TRUE)

# Construct principal components plot.

autoplot(hbr_uhr_pca,data=counts,colour="Treatment",size=5)+
        theme(axis.title=element_text(size=20),
              axis.text=element_text(size=15),
              legend.title=element_text(size=15),
              legend.text=element_text(size=15))

Running Unix commands

Users can run Unix commands within a Python Jupyter Notebook. To do this start a code block with "!" followed by the Unix command. For instance, use the pwd command in the code block below to list the present working directory.

!pwd

/Users/wuz8/Documents/jupyter_summer_series_2023

Exporting Jupyter Notebook using GUI

Exporting Jupyter Notebook using command line

Use the jupyter nbconvert command at the command prompt to convert Jupyter Notebook to various available formats, including html, pdf, and slides. The format is specified after the --to option.

jupyter nbconvert --to format

Github
- Static notebook (ie. users will not be able to run)
Binder
- Provide data
- Provide list of packages
- Users can run the notebook
- Example

Download example data and Jupyter Notebook

The example data and Jupyter Notebook are inside a zip file, so unzip it after downloading to access the content.

Example data