Lesson 4: Data visualization using Python

Learning objectives

This lesson will provide participants with enough knowledge to start using Python for data visualization. Specifically, participants should

Be able to use the package Seaborn to
- Construct plots that range from very basic to elegant as well as biologically relevant
- Customize plots including altering font size and adding custom annotations

Python data visualization tools

Seaborn is a popular Python plotting package, which is the tool that will be introduced in this lesson. Seaborn is an extension of and builds on Matplotlib and is oriented towards statistical data visualization. However, there are other packages, including those that are domain specific, implement grammar of graphics, and are used for creating web-based visualization dashboards. A non-exhaustive list of Python plotting packages is shown below.

Visualization using Seaborn

Load packages

import pandas
import numpy
import matplotlib.pyplot as plt
import seaborn

Modify the basic plot elements with Seaborn.

To plot using Seaborn, start the command with seaborn followed by the plot type, separated by a period.

seaborn.plot_type

This section will use Seaborn's scatterplot to explore how to work with and modify basic elements of plotting. The foundations learned in this section form the basis for creating advanced and elegant plots.

The data that will be plotted is a point located at 5 on the x axis and 5 on the y axis. To generate x and y, numpy.array was used. Here, x and y are single element arrays that store the number 5.

x=numpy.array([5])
y=numpy.array([5])

Plot x and y using Seaborn's scatterplot function (see Figure 1 for results), which takes data frames or Numpy arrays as input. Here, x will be plotted on the x axis, and y will be plotted on the y axis. The plot can be stored as a variable, which in this example is plot0.

plot0=seaborn.scatterplot(x=x, y=y)
plt.show()

Figure 1

The plot in Figure 1 has no axes labels. Axes labels are an integral part of an informative data visualization. It might also be useful to include meaningful x and y limits. To do this, append the various .set* attributes to the plot. See Figure 2a for result.

set_xlabel: specify x axis label (size is used to set the label font size)
set_ylabel: specify y axis
set_xlim: sets the x axis limits
set_ylim: sets the y axis limits
set_xticks: sets the location of x axis tick marks
set_xticklabels: sets the x axis tick mark labels, size is used to set the tick mark label font size
set_yticks: sets the location of y axis tick marks
set_yticklabels: sets the y axis tick mark labels, size is used to set the tick mark label font size

plot0=seaborn.scatterplot(x=x, y=y)
plot0.set_xlabel("x axis", size=14)
plot0.set_ylabel("y axis", size=14)
plot0.set_xlim(0,10)
plot0.set_ylim(0,10)
plot0.set_xticks([0,2,4,6,8,10])
plot0.set_xticklabels(labels=["0","2","4","6","8","10"], size=15)
plot0.set_yticks([0,2,4,6,8,10])
plot0.set_yticklabels(labels=["0","2","4","6","8","10"], size=15)
plt.show()

Figure 2

The plotting_context of a Seaborn plot contains parameters that determine scaling of plot elements (see https://seaborn.pydata.org/generated/seaborn.plotting_context.html). To view these parameters, do the following, which will return the plot scaling parameters as a dictionary.

print(seaborn.plotting_context())

{'font.size': 12.0, 'axes.labelsize': 12.0, 'axes.titlesize': 12.0, 'xtick.labelsize': 11.0, 'ytick.labelsize': 11.0, 'legend.fontsize': 11.0, 'legend.title_fontsize': 12.0, 'axes.linewidth': 1.25, 'grid.linewidth': 1.0, 'lines.linewidth': 1.5, 'lines.markersize': 6.0, 'patch.linewidth': 1.0, 'xtick.major.width': 1.25, 'ytick.major.width': 1.25, 'xtick.minor.width': 1.0, 'ytick.minor.width': 1.0, 'xtick.major.size': 6.0, 'ytick.major.size': 6.0, 'xtick.minor.size': 4.0, 'ytick.minor.size': 4.0}

These parameters can be changed using the set_context function by providing a customized dictionary and assigning it to the rc argument.

help(seaborn.set_context)

Help on function set_context in module seaborn.rcmod:

set_context(context=None, font_scale=1, rc=None)
    Set the parameters that control the scaling of plot elements.

    This affects things like the size of the labels, lines, and other elements
    of the plot, but not the overall style. This is accomplished using the
    matplotlib rcParams system.

    The base context is "notebook", and the other contexts are "paper", "talk",
    and "poster", which are version of the notebook parameters scaled by different
    values. Font elements can also be scaled independently of (but relative to)
    the other values.

    See :func:`plotting_context` to get the parameter values.

    Parameters
    ----------
    context : dict, or one of {paper, notebook, talk, poster}
        A dictionary of parameters or the name of a preconfigured set.
    font_scale : float, optional
        Separate scaling factor to independently scale the size of the
        font elements.
    rc : dict, optional
        Parameter mappings to override the values in the preset seaborn
        context dictionaries. This only updates parameters that are
        considered part of the context definition.

To change the x and y axes tick label font size to 20, use seaborn.set_context(rc={'xtick.labelsize': 20, 'ytick.labelsize': 20}) prior to constructing a Seaborn plot.

The code above can be modified to generate a more complex scatter plot that has more points. For instance, the inputs for x and y can be changed to numeric arrays of five 6 elements each.

x=numpy.array([0,1,2,3,4,5])
y=numpy.multiply(2,x)
print("x is a numeric array composed of: ", x)
print("y is a numeric array composed of: ", y)

x is a numeric array composed of:  [0 1 2 3 4 5]
y is a numeric array composed of:  [ 0  2  4  6  8 10]

The code used to generate Figure 2 can then be run again with modifications to the x and y axes limits to generate the plot shown in Figure 3. To produce a line plot representation of Figure 3, simply change the plot type to lineplot (seaborn.lineplot).

plot0=seaborn.scatterplot(x=x, y=y)
plot0.set_xlabel("x axis", size=14)
plot0.set_ylabel("y axis", size=14)
plot0.set_xlim(0,6)
plot0.set_ylim(0,12)
plot0.set_xticks([0,2,4,6])
plot0.set_xticklabels(labels=["0","2","4","6"], size=15)
plot0.set_yticks([0,2,4,6,8,10,12])
plot0.set_yticklabels(labels=["0","2","4","6","8","10","12"], size=15)
plt.show()

Figure 3

Constructing biologically relevant plots

The next exercise is to practice creating a scatter plot on a biologically relevant dataset. Namely, the differential expression results from the hbr and uhr RNA sequencing study will be used to create a scatter plot depicting log2 fold change of gene expression on the x axis and negative log10 of the adjusted p-values on the y axis. This special case of scatter plot is called a volcano plot.

Step one is to import the data using Panda's read.csv command.

hbr_uhr_deg_chr22=pandas.read_csv("./hbr_uhr_deg_chr22_with_significance.csv")

Now, review the contents of this data table by doing the following.

hbr_uhr_deg_chr22.head(4)

    name    log2FoldChange      PAdj        -log10PAdj  significance
0   SYNGR1      -4.6         5.200000e-217  216.283997  down
1   SEPT3       -4.6         4.500000e-204  203.346787  down
2   YWHAH       -2.5         4.700000e-191  190.327902  down
3   RPL3        1.7          5.400000e-134  133.267606  down

To create the volcano plot, provide the following arguments. See Figure 4 for result.

The data frame (ie. hbr_uhr_deg_chr22)
What to plot on the x axis (ie. log2FoldChange)
What to plot on the y axis (ie. "-log10PAdj")

plot1=seaborn.scatterplot(hbr_uhr_deg_chr22,x="log2FoldChange", y="-log10PAdj")

Figure 4

The volcano plot in Figure 4 does not help with visualizing the up, down, an non-significant genes. Fortunately, the hue option can be used to distinguish these. See Figure 5.

plot1=seaborn.scatterplot(hbr_uhr_deg_chr22,x="log2FoldChange", y="-log10PAdj", hue="significance")

Figure 5

It would be informative to label some of the top significant differentially expressed genes in the volcano plot. To do this, import the file hbr_uhr_deg_chr22_top_genes.csv and assign it to the data frame hbr_uhr_deg_chr22_top_genes.

hbr_uhr_deg_chr22_top_genes=pandas.read_csv("./hbr_uhr_deg_chr22_top_genes.csv")

hbr_uhr_deg_chr22_top_genes

The table contains the top two differentially expressed genes according to the adjusted p-value (PAdj). The task to do is to label the points corresponding to these two genes on the volcano plot. The values for log2FoldChange and -log10PAdj will serve as the x and y coordinates for plotting the gene name.

    name    log2FoldChange      PAdj        -log10PAdj  significance
0   XBP1        2.8         7.300000e-90    89.136677   up
1   SYNGR1      -4.6        5.200000e-217   216.283997  down

To label the two top differentially expressed genes, start by constructing the volcano plot from Figure 5. Then, use a for loop to iterate through the name column in the data frame hbr_uhr_deg_chr22_top_genes. In the for loop

i: the number that keeps track of the row number in the data frame hbr_uhr_deg_chr22_top_genes and is used to
- reference the x coordinate or log2FoldChange value in that row
- reference the y coordinate or -log10PAdj value in that row
enumerate: iterate through the name column in hbr_uhr_deg_chr22_top_genes and stores the name to variable gene_name. i is incremented as it iterates through the name column within the for loop

plot1=seaborn.scatterplot(hbr_uhr_deg_chr22,x="log2FoldChange", y="-log10PAdj", hue="significance")
for i, gene_name in enumerate(hbr_uhr_deg_chr22_top_genes["name"]):
    plot1.text(hbr_uhr_deg_chr22_top_genes["log2FoldChange"][i], 
              hbr_uhr_deg_chr22_top_genes["-log10PAdj"][i],gene_name)

Figure 6

The next visualization is the heatmap and dendrogram combination, which helps with visualizing clusters and patterns. Heatmap and dendrogram can be used in RNA sequencing studies to inspect whether there are cluster of genes with similar expression patterns among treatment groups. The normalized counts for the top differential expressed genes in the hbr and uhr study will be used to construct a heatmap/dendrogram using Seaborn's clustermap.

Import the data.

hbr_uhr_top_deg_normalized_counts=pandas.read_csv("./hbr_uhr_top_deg_normalized_counts.csv", index_col=0)

The seaborn.clustermap command below generates a clustermap of the top differential expressed genes in the hbr and uhr study. The arguments and options are as follows.

Argument: The dataset (ie. hbr_uhr_top_deg_normalized_counts)
Options:
- z_score=0: scale the rows by z-score
- cmap: specify color palette (ie. viridis)
- figsize: specify figure size
- vmin: minimum value on the color scale bar
- vmax: maximum value on the color scale bar
- cbar_kws: dictionary containing key value pair that specifies the title to the color scale bar
- cbar_pos: coordinates for placement of the color scale bar

plot4=seaborn.clustermap(hbr_uhr_top_deg_normalized_counts,z_score=0,cmap="viridis",
                        figsize=(8,8),vmin=-1.5, vmax=1.5,cbar_kws=({"label": "z score"}),
                        cbar_pos=(0.855,0.8,0.025,0.15))

Figure 9: Expression heatmap of the top 12 differentially expressed genes in the HBR and UHR study

Below, a Pandas Series, called samples that contains a mapping of colors to study samples is created.

samples=pandas.Series({"HBR_1":"orangered", "HBR_2":"orangered", "HBR_3":"orangered", "UHR_1":"blue", "UHR_2":"blue", "UHR_3":"blue"})

Then a variable, column_colors is created that contains a mapping of the hbr_uhr_top_deg_normalized_counts column headings to the colors specified in samples. This is accomplished using the map command.

column_colors=hbr_uhr_top_deg_normalized_counts.columns.map(samples)

The option col_colors, which is set to column_colors is added to display a color bar on the top of the heatmap that helps to distinguish treatment groups (ie. hbr or uhr).

Other options added include

ax_heatmap.set_xticklabels: allows for customizing the x axis labels' fontsize and rotation. This requires using ax_heatmap.get_xmajorticklabels() to get the x axis tick labels
ax_cbar.tick_params: sets the size for the color scale bar labels
ax_col_colors.set_title: sets the title and location bar displaying the treatment group to color mapping

plot4=seaborn.clustermap(hbr_uhr_top_deg_normalized_counts,z_score=0,cmap="viridis",
                        figsize=(8,8),vmin=-1.5, vmax=1.5,cbar_kws=({"label": "z score"}),
                        col_colors=column_colors, cbar_pos=(0.855,0.8,0.025,0.15))
plot4.ax_heatmap.set_xticklabels(plot4.ax_heatmap.get_xmajorticklabels(),fontsize=12,rotation=90)
plot4.ax_cbar.tick_params(labelsize=12)
plot4.ax_col_colors.set_title("treatment",x=-0.1,y=0.01)
plt.show()

Figure 10: Expression heatmap of the top 12 differentially expressed genes in the HBR and UHR study with treatment group annotations.