Lesson 4: Data visualization using Python
Learning objectives
This lesson will provide participants with enough knowledge to start using Python for data visualization. Specifically, participants should
- Be able to use the package Seaborn to
- Construct plots that range from very basic to elegant as well as biologically relevant
- Customize plots including altering font size and adding custom annotations
Python data visualization tools
Seaborn is a popular Python plotting package, which is the tool that will be introduced in this lesson. Seaborn is an extension of and builds on Matplotlib and is oriented towards statistical data visualization. However, there are other packages, including those that are domain specific, implement grammar of graphics, and are used for creating web-based visualization dashboards. A non-exhaustive list of Python plotting packages is shown below.
- Matplotlib
- Plotnine: implements grammar of graphics for those familiar with R's ggplot2
- bioinfokit: genomic data visualization
- pygenomeviz: visuazlize comparative genomics data
- Dash bio: create interactive data visualizations and web dashboards
Visualization using Seaborn
Load packages
import pandas
import numpy
import matplotlib.pyplot as plt
import seaborn
Modify the basic plot elements with Seaborn.
To plot using Seaborn, start the command with seaborn
followed by the plot type, separated by a period.
seaborn.plot_type
This section will use Seaborn's scatterplot
to explore how to work with and modify basic elements of plotting. The foundations learned in this section form the basis for creating advanced and elegant plots.
The data that will be plotted is a point located at 5 on the x axis and 5 on the y axis. To generate x and y, numpy.array
was used. Here, x and y are single element arrays that store the number 5.
x=numpy.array([5])
y=numpy.array([5])
Plot x and y using Seaborn's scatterplot
function (see Figure 1 for results), which takes data frames or Numpy arrays as input. Here, x will be plotted on the x axis, and y will be plotted on the y axis. The plot can be stored as a variable, which in this example is plot0.
plot0=seaborn.scatterplot(x=x, y=y)
plt.show()
Figure 1
The plot in Figure 1 has no axes labels. Axes labels are an integral part of an informative data visualization. It might also be useful to include meaningful x and y limits. To do this, append the various .set*
attributes to the plot. See Figure 2a for result.
set_xlabel
: specify x axis label (size
is used to set the label font size)set_ylabel
: specify y axisset_xlim
: sets the x axis limitsset_ylim
: sets the y axis limitsset_xticks
: sets the location of x axis tick marksset_xticklabels
: sets the x axis tick mark labels,size
is used to set the tick mark label font sizeset_yticks
: sets the location of y axis tick marksset_yticklabels
: sets the y axis tick mark labels,size
is used to set the tick mark label font size
plot0=seaborn.scatterplot(x=x, y=y)
plot0.set_xlabel("x axis", size=14)
plot0.set_ylabel("y axis", size=14)
plot0.set_xlim(0,10)
plot0.set_ylim(0,10)
plot0.set_xticks([0,2,4,6,8,10])
plot0.set_xticklabels(labels=["0","2","4","6","8","10"], size=15)
plot0.set_yticks([0,2,4,6,8,10])
plot0.set_yticklabels(labels=["0","2","4","6","8","10"], size=15)
plt.show()
Figure 2
The plotting_context
of a Seaborn plot contains parameters that determine scaling of plot elements (see https://seaborn.pydata.org/generated/seaborn.plotting_context.html). To view these parameters, do the following, which will return the plot scaling parameters as a dictionary.
print(seaborn.plotting_context())
{'font.size': 12.0, 'axes.labelsize': 12.0, 'axes.titlesize': 12.0, 'xtick.labelsize': 11.0, 'ytick.labelsize': 11.0, 'legend.fontsize': 11.0, 'legend.title_fontsize': 12.0, 'axes.linewidth': 1.25, 'grid.linewidth': 1.0, 'lines.linewidth': 1.5, 'lines.markersize': 6.0, 'patch.linewidth': 1.0, 'xtick.major.width': 1.25, 'ytick.major.width': 1.25, 'xtick.minor.width': 1.0, 'ytick.minor.width': 1.0, 'xtick.major.size': 6.0, 'ytick.major.size': 6.0, 'xtick.minor.size': 4.0, 'ytick.minor.size': 4.0}
These parameters can be changed using the set_context
function by providing a customized dictionary and assigning it to the rc
argument.
help(seaborn.set_context)
Help on function set_context in module seaborn.rcmod:
set_context(context=None, font_scale=1, rc=None)
Set the parameters that control the scaling of plot elements.
This affects things like the size of the labels, lines, and other elements
of the plot, but not the overall style. This is accomplished using the
matplotlib rcParams system.
The base context is "notebook", and the other contexts are "paper", "talk",
and "poster", which are version of the notebook parameters scaled by different
values. Font elements can also be scaled independently of (but relative to)
the other values.
See :func:`plotting_context` to get the parameter values.
Parameters
----------
context : dict, or one of {paper, notebook, talk, poster}
A dictionary of parameters or the name of a preconfigured set.
font_scale : float, optional
Separate scaling factor to independently scale the size of the
font elements.
rc : dict, optional
Parameter mappings to override the values in the preset seaborn
context dictionaries. This only updates parameters that are
considered part of the context definition.
To change the x and y axes tick label font size to 20, use seaborn.set_context(rc={'xtick.labelsize': 20, 'ytick.labelsize': 20})
prior to constructing a Seaborn plot.
The code above can be modified to generate a more complex scatter plot that has more points. For instance, the inputs for x and y can be changed to numeric arrays of five 6 elements each.
x=numpy.array([0,1,2,3,4,5])
y=numpy.multiply(2,x)
print("x is a numeric array composed of: ", x)
print("y is a numeric array composed of: ", y)
x is a numeric array composed of: [0 1 2 3 4 5]
y is a numeric array composed of: [ 0 2 4 6 8 10]
The code used to generate Figure 2 can then be run again with modifications to the x and y axes limits to generate the plot shown in Figure 3. To produce a line plot representation of Figure 3, simply change the plot type to lineplot (seaborn.lineplot
).
plot0=seaborn.scatterplot(x=x, y=y)
plot0.set_xlabel("x axis", size=14)
plot0.set_ylabel("y axis", size=14)
plot0.set_xlim(0,6)
plot0.set_ylim(0,12)
plot0.set_xticks([0,2,4,6])
plot0.set_xticklabels(labels=["0","2","4","6"], size=15)
plot0.set_yticks([0,2,4,6,8,10,12])
plot0.set_yticklabels(labels=["0","2","4","6","8","10","12"], size=15)
plt.show()
Figure 3
Constructing biologically relevant plots
The next exercise is to practice creating a scatter plot on a biologically relevant dataset. Namely, the differential expression results from the hbr and uhr RNA sequencing study will be used to create a scatter plot depicting log2 fold change of gene expression on the x axis and negative log10 of the adjusted p-values on the y axis. This special case of scatter plot is called a volcano plot.
Step one is to import the data using Panda's read.csv
command.
hbr_uhr_deg_chr22=pandas.read_csv("./hbr_uhr_deg_chr22_with_significance.csv")
Now, review the contents of this data table by doing the following.
hbr_uhr_deg_chr22.head(4)
name log2FoldChange PAdj -log10PAdj significance
0 SYNGR1 -4.6 5.200000e-217 216.283997 down
1 SEPT3 -4.6 4.500000e-204 203.346787 down
2 YWHAH -2.5 4.700000e-191 190.327902 down
3 RPL3 1.7 5.400000e-134 133.267606 down
To create the volcano plot, provide the following arguments. See Figure 4 for result.
- The data frame (ie. hbr_uhr_deg_chr22)
- What to plot on the x axis (ie. log2FoldChange)
- What to plot on the y axis (ie. "-log10PAdj")
plot1=seaborn.scatterplot(hbr_uhr_deg_chr22,x="log2FoldChange", y="-log10PAdj")
Figure 4
The volcano plot in Figure 4 does not help with visualizing the up, down, an non-significant genes. Fortunately, the hue
option can be used to distinguish these. See Figure 5.
plot1=seaborn.scatterplot(hbr_uhr_deg_chr22,x="log2FoldChange", y="-log10PAdj", hue="significance")
Figure 5
It would be informative to label some of the top significant differentially expressed genes in the volcano plot. To do this, import the file hbr_uhr_deg_chr22_top_genes.csv and assign it to the data frame hbr_uhr_deg_chr22_top_genes.
hbr_uhr_deg_chr22_top_genes=pandas.read_csv("./hbr_uhr_deg_chr22_top_genes.csv")
hbr_uhr_deg_chr22_top_genes
The table contains the top two differentially expressed genes according to the adjusted p-value (PAdj). The task to do is to label the points corresponding to these two genes on the volcano plot. The values for log2FoldChange and -log10PAdj will serve as the x and y coordinates for plotting the gene name.
name log2FoldChange PAdj -log10PAdj significance
0 XBP1 2.8 7.300000e-90 89.136677 up
1 SYNGR1 -4.6 5.200000e-217 216.283997 down
To label the two top differentially expressed genes, start by constructing the volcano plot from Figure 5. Then, use a for
loop to iterate through the name column in the data frame hbr_uhr_deg_chr22_top_genes. In the for
loop
i
: the number that keeps track of the row number in the data frame hbr_uhr_deg_chr22_top_genes and is used to- reference the x coordinate or log2FoldChange value in that row
- reference the y coordinate or -log10PAdj value in that row
enumerate
: iterate through the name column in hbr_uhr_deg_chr22_top_genes and stores the name to variable gene_name.i
is incremented as it iterates through the name column within thefor
loop
plot1=seaborn.scatterplot(hbr_uhr_deg_chr22,x="log2FoldChange", y="-log10PAdj", hue="significance")
for i, gene_name in enumerate(hbr_uhr_deg_chr22_top_genes["name"]):
plot1.text(hbr_uhr_deg_chr22_top_genes["log2FoldChange"][i],
hbr_uhr_deg_chr22_top_genes["-log10PAdj"][i],gene_name)
Figure 6
The next visualization is the heatmap and dendrogram combination, which helps with visualizing clusters and patterns. Heatmap and dendrogram can be used in RNA sequencing studies to inspect whether there are cluster of genes with similar expression patterns among treatment groups. The normalized counts for the top differential expressed genes in the hbr and uhr study will be used to construct a heatmap/dendrogram using Seaborn's clustermap
.
Import the data.
hbr_uhr_top_deg_normalized_counts=pandas.read_csv("./hbr_uhr_top_deg_normalized_counts.csv", index_col=0)
The seaborn.clustermap
command below generates a clustermap of the top differential expressed genes in the hbr and uhr study. The arguments and options are as follows.
- Argument: The dataset (ie. hbr_uhr_top_deg_normalized_counts)
- Options:
z_score=0
: scale the rows by z-scorecmap
: specify color palette (ie. viridis)figsize
: specify figure sizevmin
: minimum value on the color scale barvmax
: maximum value on the color scale barcbar_kws
: dictionary containing key value pair that specifies the title to the color scale barcbar_pos
: coordinates for placement of the color scale bar
plot4=seaborn.clustermap(hbr_uhr_top_deg_normalized_counts,z_score=0,cmap="viridis",
figsize=(8,8),vmin=-1.5, vmax=1.5,cbar_kws=({"label": "z score"}),
cbar_pos=(0.855,0.8,0.025,0.15))
Figure 9: Expression heatmap of the top 12 differentially expressed genes in the HBR and UHR study
Below, a Pandas Series, called samples that contains a mapping of colors to study samples is created.
samples=pandas.Series({"HBR_1":"orangered", "HBR_2":"orangered", "HBR_3":"orangered", "UHR_1":"blue", "UHR_2":"blue", "UHR_3":"blue"})
Then a variable, column_colors is created that contains a mapping of the hbr_uhr_top_deg_normalized_counts column headings to the colors specified in samples. This is accomplished using the map
command.
column_colors=hbr_uhr_top_deg_normalized_counts.columns.map(samples)
The option col_colors
, which is set to column_colors is added to display a color bar on the top of the heatmap that helps to distinguish treatment groups (ie. hbr or uhr).
Other options added include
ax_heatmap.set_xticklabels
: allows for customizing the x axis labels' fontsize and rotation. This requires usingax_heatmap.get_xmajorticklabels()
to get the x axis tick labelsax_cbar.tick_params
: sets the size for the color scale bar labelsax_col_colors.set_title
: sets the title and location bar displaying the treatment group to color mapping
plot4=seaborn.clustermap(hbr_uhr_top_deg_normalized_counts,z_score=0,cmap="viridis",
figsize=(8,8),vmin=-1.5, vmax=1.5,cbar_kws=({"label": "z score"}),
col_colors=column_colors, cbar_pos=(0.855,0.8,0.025,0.15))
plot4.ax_heatmap.set_xticklabels(plot4.ax_heatmap.get_xmajorticklabels(),fontsize=12,rotation=90)
plot4.ax_cbar.tick_params(labelsize=12)
plot4.ax_col_colors.set_title("treatment",x=-0.1,y=0.01)
plt.show()
Figure 10: Expression heatmap of the top 12 differentially expressed genes in the HBR and UHR study with treatment group annotations.