Lesson 3: Data wrangling using Python

Learning objectives

After this lesson, participants will

Be able to import tabular data into Python using Pandas
Be able to explore and modify tabular data through various data wrangling approaches, including
- retrieving dimensions
- subsetting
- obtaining column statistics
- replacing column names
- performing mathematical operations
- filtering
- removing and adding columns

Importing tabular data using Pandas

Pandas is a popular Python package used to work with tabular data.

To work with Pandas, first activate it using the import command.

import pandas

Sometimes the name of the package is long, so users might want to shorten it by creating an alias. The alias "pd" is often used for the Pandas package. To add an alias, just append as followed by the user defined alias to the package import command. If importing a package using an alias, then the package needes to be called using the assigned alias. For instance, if pd was used to import pandas, then use pd.read_csv to import a csv file.

import pandas as pd

This exercise will use the read_csv function of Pandas to import a comma separated value (csv) file called hbr_uhr_chr22_rna_seq_counts.csv, which contains RNA sequencing gene expression counts from the Human Brain Reference (hbr) and Universal Human Reference (uhr) study. This data will be stored as the variable hbr_uhr_chr22_counts.

hbr_uhr_chr22_counts=pandas.read_csv("./hbr_uhr_chr22_rna_seq_counts.csv")

Take a look at the first few rows of hbr_uhr_chr22_counts by appending the head attribute to hbr_uhr_chr22_counts.

hbr_uhr_chr22_counts.head()

Figure 1: The first five rows of hbr_uhr_chr22_counts. The first column contains genes and the subsequent columns contain gene expression counts for each of the samples. The left most column of this data frame contains the row indices or names.

Because hbr_uhr_chr22_counts is a Pandas data frame (type(hbr_uhr_chr22_counts), see lesson 2), it is possible to append one of the many Pandas commands to it. For instance, the head function was appended to display the first five rows of hbr_uhr_chr22_counts. The data frame name and function is separated by a period. This is perhaps one of the most appealing aspects of Python syntax. Note that the head function was followed by (). If the parentheses are blank, then the default first five lines will be shown. To view the first 10 rows of hbr_uhr_chr22_counts do the following.

hbr_uhr_chr22_counts.head(10)

Figure 2: Include an integer inside the parentheses of pandas.dataframe.head() function to view the specified number of lines in a tabular dataset.

The function tail can be used to view by default the bottom five lines of a tabular dataset. Similar to head, the number of lines shown can be customized by specifying an integer inside the parentheses.

hbr_uhr_chr22_counts.tail()

Get dimensions of a data frame

Pandas data frames have a function shape that informs of the number of rows and number of columns in a data frame (in other words the dimensions). To get the dimensions for hbr_uhr_chr22_counts, do the following

hbr_uhr_chr22_counts.shape

The hbr_uhr_chr22_counts data frame has 1335 rows and 7 columns.

(1335, 7)

Note

The elements in tabular data can be referred to by their row and column positions.

The size function returns the number elements in a data frame. For instance, hbr_uhr_chr22_counts has 1335 rows and 7 columns, which means that it has 1335 times 7 elements (or 9345).

Row indices/names

Figure 2 shows the first 10 rows of hbr_uhr_chr22_counts. The left most column, which contains labels starting with "0" is referred to as the row indices or row names. Users can specify a column in the dataset as the row indices or row names using the index_col options in read_csv. For instance, the hbr_uhr_chr22_rna_seq_counts.csv dataset could be imported with gene names as the row indices. To do this, add the index_col=0 option to read_csv. Gene names in hbr_uhr_chr22_rna_seq_counts.csv is the first column and is denoted as column "0" in Python. Thus, setting index_col=0 ensures that the gene names will be set as the row indices or row names (see Figure 3).

hbr_uhr_chr22_counts_1=pandas.read_csv("./hbr_uhr_chr22_rna_seq_counts.csv", index_col=0)

Figure 3. The index_col=0 option in pandas.read_csv sets the gene names as row names in the imported data frame.

Data wrangling

Subsetting

The command below will subset the expression counts for the RABL2B gene.

hbr_uhr_chr22_counts[hbr_uhr_chr22_counts["Geneid"]=="RABL2B"]

        Geneid  HBR_1.bam   HBR_2.bam   HBR_3.bam   UHR_1.bam   UHR_2.bam   UHR_3.bam
1334    RABL2B      74          62          54          68          50          47

The "|" symbol can be used as the "or" operator so to also subset the counts for RPL23AP82

hbr_uhr_chr22_counts[(hbr_uhr_chr22_counts["Geneid"]=="RABL2B") | (hbr_uhr_chr22_counts["Geneid"]=="RPL23AP82")]

        Geneid      HBR_1.bam   HBR_2.bam   HBR_3.bam   UHR_1.bam   UHR_2.bam   UHR_3.bam
1333    RPL23AP82       41          59          54          32          23          34
1334    RABL2B          74          62          54          68          50          47

Alternatively, use the isin function and provide a list of genes to retrieve.

hbr_uhr_chr22_counts[hbr_uhr_chr22_counts["Geneid"].isin(["RABL2B", "RPL23AP82"])]

Use "." to reference a column.

hbr_uhr_chr22_counts[hbr_uhr_chr22_counts.Geneid=="RABL2B"]

Subsetting by integer positions

Given that the elements in a data frame are referenced by its row and column positions, what would be the approach for extracting the element in row 60 and column 5? The solution is the command below, which returns a result of 2. The row and column numbers are enclosed in "[]" and separated by a comma.

hbr_uhr_chr22_counts.iloc[60,5]

The above method for subsetting the element in row 60 and column 5 of hbr_uhr_chr22_counts is great if the goal is to extract the value and do numeric operation on it. But what if the user wants to return the element along with the corresponding gene in data frame format?

To do this, enclose the row and column indices to extract in their own inner set of square brackets as shown below. Column 0, which contains the gene name is also included in the brackets containing the column indices of interest.

hbr_uhr_chr22_counts.iloc[[60],[0,5]]

    Geneid  UHR_2.bam
60  CCT8L2      2

Pandas offers different approaches for subsetting rectangular data. One method is iloc.

iloc is a "purely integer-location based indexing for selection by position" -- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html#. The row and column positions are enclosed in "[]".

iloc allows for retrieval of elements in multiple rows and columns. For instance, the following can be used to retrieve the elements in rows 60 and 65 and columns 0, 4, 5, and 6 in hbr_uhr_chr22_counts. Note that the row and column positions are enclosed in an outer set of "[]". Within the outer set of "[]" the first set of "[]" enclose a comma separated list of row positions while the second set of "[]" enclose a comma separated list of column positions.

hbr_uhr_chr22_counts.iloc[[60,65],[0,4,5,6]]

    Geneid      UHR_1.bam   UHR_2.bam   UHR_3.bam
60  CCT8L2          1           2           0
65  SLC25A15P5      2           2           4

To get the first three rows of hbr_uhr_chr22_counts do the following. Note that it retrieves the rows with indices 0, 1, and 2.

hbr_uhr_chr22_counts.iloc[:3]

    Geneid  HBR_1.bam   HBR_2.bam   HBR_3.bam   UHR_1.bam   UHR_2.bam   UHR_3.bam
0   U2          0           0           0           0           0           0
1   CU459211.1  0           0           0           0           0           0
2   CU104787.1  0           0           0           0           0           0

What will be the output for hbr_uhr_chr22_counts.iloc[[3],:]?

Solution

The row with an index of 3 will be retrieved.

    Geneid  HBR_1.bam   HBR_2.bam   HBR_3.bam   UHR_1.bam   UHR_2.bam   UHR_3.bam
3   BAGE5       0           0           0           0           0           0

Subsetting using column names

Panda's loc function allows for subsetting by row or column names. For instance, to retrieve the gene id column, do the following. The ":" denotes get every row.

hbr_uhr_chr22_counts.loc[:,['Geneid']]

        Geneid
0       U2
1       CU459211.1
2       CU104787.1
3       BAGE5
4       ACTR3BP6
...     ...
1330    ACR
1331    AC002056.5
1332    AC002056.3
1333    RPL23AP82
1334    RABL2B

To retrieve the counts for the gene SLC25A15P5, use the following where SLC25A15P5 is the subsetting criteria, where

hbr_uhr_chr22_counts.loc[:,'Geneid'] extracts the Geneid column.
=="SLC25A15P5" will filter out the row with the SLC25A15P5 gene.

hbr_uhr_chr22_counts[hbr_uhr_chr22_counts.loc[:,'Geneid']=="SLC25A15P5"]

    Geneid      HBR_1.bam    HBR_2.bam  HBR_3.bam   UHR_1.bam   UHR_2.bam   UHR_3.bam
65  SLC25A15P5      0           0           0           2           2           4

To retrieve counts for more than one gene, enclose the genes of interest in a list and use the isin function to filter out the rows containing the genes in the list.

hbr_uhr_chr22_counts[hbr_uhr_chr22_counts.loc[:,'Geneid'].isin(["SLC25A15P5", "CCT8L2"])]

    Geneid      HBR_1.bam   HBR_2.bam   HBR_3.bam   UHR_1.bam   UHR_2.bam   UHR_3.bam
60  CCT8L2          0           0           0           1           2           0
65  SLC25A15P5      0           0           0           2           2           4

To find all of the SLC genes in hbr_uhr_chr22_counts, the following could be used where str.startswith searches for text that starts a pattern (ie. "SLC"). Other options for pattern matching include str.endwith and str.contains.

hbr_uhr_chr22_counts.loc[hbr_uhr_chr22_counts.loc[:,'Geneid'].str.startswith("SLC")]

        Geneid      HBR_1.bam   HBR_2.bam   HBR_3.bam   UHR_1.bam   UHR_2.bam   UHR_3.bam
54      SLC9B1P4        0           0           0           0           1           0
65      SLC25A15P5      0           0           0           2           2           4
109     SLC25A18        100         111         74          6           8           7
181     SLC25A1         32          50          41          226         138         216
249     SLC9A3P2        0           0           0           0           0           2
268     SLC7A4          19          25          14          9           4           3
494     SLC2A11         54          63          46          28          34          27
726     SLC35E4         18          32          26          21          12          13
783     SLC5A1          0           0           0           0           6           0
795     SLC5A4          7           12          5           13          9           4
955     SLC16A8         9           13          11          11          5           6
1046    SLC25A17        39          39          40          119         96          116
1099    SLC25A5P1       0           0           1           0           1           0

Summary statistics of data frames

hbr_uhr_chr22_counts.describe()

        HBR_1.bam   HBR_2.bam   HBR_3.bam   UHR_1.bam   UHR_2.bam   UHR_3.bam
count   1335.000000 1335.000000 1335.000000 1335.000000 1335.000000 1335.000000
mean    29.530337   36.264419   32.084644   50.694382   33.419476   40.334831
std     99.177874   120.617793  108.237694  197.575081  122.598310  154.455918
min     0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
25%     0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
50%     0.000000    0.000000    0.000000    1.000000    1.000000    1.000000
75%     8.000000    10.000000   9.000000    13.000000   12.000000   11.000000
max     1532.000000 1797.000000 1637.000000 4027.000000 2406.000000 3047.000000

Replacing column names

To view the column headings of a data frame use the column function. For instance,

hbr_uhr_chr22_counts.columns

HBR_1.bam   
HBR_2.bam   
HBR_3.bam   
UHR_1.bam   
UHR_2.bam   
UHR_3.bam

The str.replace function can be used to replace a string with something else. Here, it used to remove ".bam" from the sample names in the column heading.

hbr_uhr_chr22_counts.columns=hbr_uhr_chr22_counts.columns.str.replace(".bam", "")

Mathematical operations on data frames and filtering

Pandas enables mathematical operations on data frames. For instance, one might want to sum the total counts across all samples for each gene. The sum function can be used to this. Setting axis=1 will sum up the counts for each row or gene. Because the Geneid column is a string, it is necessary to first subset only the sample columns.

hbr_uhr_chr22_counts.loc[:, ['HBR_1', 'HBR_2', 'HBR_3', 'UHR_1', 'UHR_2', 'UHR_3']].sum(axis=1)

Below, genes with zero counts across all samples are removed from hbr_uhr_chr22_counts and stored as hbr_uhr_chr22_counts_filtered. To accomplish this set hbr_uhr_chr22_counts.loc[:, ['HBR_1', 'HBR_2', 'HBR_3', 'UHR_1', 'UHR_2', 'UHR_3']].sum(axis=1) !=0 and use as a filter criteria.

hbr_uhr_chr22_counts_filtered=hbr_uhr_chr22_counts.loc[hbr_uhr_chr22_counts.loc[:, ['HBR_1', 'HBR_2', 'HBR_3', 'UHR_1', 'UHR_2', 'UHR_3']].sum(axis=1)!=0]

Removing and adding columns to a data frame

For this exercise, stay in the /data/username/pies_2023 folder, which should be the present working directory (use pwd to check). If not in the /data/username/pies_2023 folder, change into it. Copy the hbr_uhr_deg_chr22.csv and hcc1395_deg_chr22.csv files from /data/classes/BTEP/pies_2023_data to the /data/username/pies_2023 directory.

cp /data/classes/BTEP/pies_2023_data/hbr_uhr_deg_chr22.csv .

cp /data/classes/BTEP/pies_2023_data/hcc1395_deg_chr22.csv .

The file hcc1395_deg_chr22.csv will be needed for the practice questions.

This exercise will use the differential gene expression analysis table from the hbr and uhr study.

hbr_uhr_deg_chr22=pandas.read_csv("./hbr_uhr_deg_chr22.csv")

The info() function will retrieve information regarding the hbr_uhr_deg_chr22 data frame, which includes the column names.

hbr_uhr_deg_chr22.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1335 entries, 0 to 1334
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            1335 non-null   object 
 1   baseMean        1335 non-null   float64
 2   baseMeanA       1335 non-null   float64
 3   baseMeanB       1335 non-null   float64
 4   foldChange      971 non-null    float64
 5   log2FoldChange  971 non-null    float64
 6   lfcSE           971 non-null    float64
 7   stat            971 non-null    float64
 8   PValue          971 non-null    float64
 9   PAdj            971 non-null    float64
 10  FDR             639 non-null    float64
 11  falsePos        639 non-null    float64
 12  HBR_1.bam       1335 non-null   float64
 13  HBR_2.bam       1335 non-null   float64
 14  HBR_3.bam       1335 non-null   float64
 15  UHR_1.bam       1335 non-null   float64
 16  UHR_2.bam       1335 non-null   float64
 17  UHR_3.bam       1335 non-null   float64
dtypes: float64(17), object(1)
memory usage: 187.9+ KB

The hbr_uhr_deg_chr22 table contains differential gene expression analysis results. Relevant columns include

name: gene names
log2FoldChange: the gene expression change between the two treatment groups
PAdj: the adjusted p-value associated with statistical confidence of the expression change
The columns labeled with the sample names (ie. columns 12 through 17) are the normalized gene expression counts

Use str.replace to remove ".bam" from the sample names in columns 12 through 17.

hbr_uhr_deg_chr22.columns=hbr_uhr_deg_chr22.columns.str.replace(".bam", "")

To drop columns in a Pandas data frame, use the .drop function and specify the name(s) of the column(s) to remove. The example below removes columns baseMean, baseMeanA,and baseMeanB

hbr_uhr_deg_chr22.drop(columns=["baseMean","baseMeanA", "baseMeanB"])

Subset the name, log2FoldChange, and PAdj columns in hbr_uhr_deg_chr22 and save to a new data frame hbr_uhr_deg_chr22_1.

hbr_uhr_deg_chr22_1=hbr_uhr_deg_chr22.loc[:,["name", "log2FoldChange", "PAdj"]]

hbr_uhr_deg_chr22_1.head()

    name    log2FoldChange      PAdj        
0   SYNGR1      -4.6        5.200000e-217
1   SEPT3       -4.6        4.500000e-204   
2   YWHAH       -2.5        4.700000e-191   
3   RPL3        1.7         5.400000e-134
4   PI4KA       -2.0        2.900000e-118

Next, add a column called "-log10PAdj" to hbr_uhr_deg_chr22_1, which will contain the negative of log10 of the values in the PAdj column. "-log10PAdj" is used in volcano plots that depict gene expression change versus statistical confidence. To calculate -log10PAdj, the package numpy will be used. Numpy enables scientific calculations.

import numpy

hbr_uhr_deg_chr22_1["-log10PAdj"]=numpy.negative(numpy.log10(hbr_uhr_deg_chr22_1.loc[:,"PAdj"]))

Take a look at the first several lines of hbr_uhr_deg_chr22_1

hbr_uhr_deg_chr22_1.head()

    name    log2FoldChange      PAdj            -log10PAdj
0   SYNGR1  -4.6                5.200000e-217   216.283997
1   SEPT3   -4.6                4.500000e-204   203.346787
2   YWHAH   -2.5                4.700000e-191   190.327902
3   RPL3    1.7                 5.400000e-134   133.267606
4   PI4KA   -2.0                2.900000e-118   117.537602

Other methods for adding new column to a Pandas data frame include insert and assign.

The final task for this lesson is to add a column that indicates whether a gene is up regulated, down regulated, or has no change based on the log2FoldChange and PAdj values. The criteria are as follows.

PAdj >= 0.01: no change (marked as ns in the column)
Absolute value of log2FoldChange <2: no change (marked as ns in the column)
log2FoldChange >= 2 and PAdj < 0.01: (up regulated)
log2FoldChange <=2 and PAdj < 0.01: (down regulated)

To code this in Python, the first step is to drop the NA values from the hbr_uhr_deg_chr22_1 using dropna.

hbr_uhr_deg_chr22_1=hbr_uhr_deg_chr22_1.dropna()

Next, create a list called significance_criteria that contains the criteria shown above. In the criteria list below, "&" is the Boolean for "and". To calculate the absolute value of log2FoldChange, numpy.absolute is used.

significance_criteria=[(hbr_uhr_deg_chr22_1["PAdj"]>=0.01),
                       (numpy.absolute(hbr_uhr_deg_chr22_1["log2FoldChange"])<2),
                      (hbr_uhr_deg_chr22_1["log2FoldChange"]>=2) & (hbr_uhr_deg_chr22_1["PAdj"]<0.01),
                      (hbr_uhr_deg_chr22_1["log2FoldChange"]<=-2) & (hbr_uhr_deg_chr22_1["PAdj"]<0.01)]

Then, create a list called significance_status that indicates whether the criteria are ns (not significant), up, or down. These statuses have to correspond to the order in which the criteria were listed in significance_criteria.

significance_status=["ns","ns","up","down"]

Finally, numpy.select will be used to assign values to the significance column.

hbr_uhr_deg_chr22_1["significance"]=numpy.select(significance_criteria,significance_status)

hbr_uhr_deg_chr22_1.head(4)

    name    log2FoldChange          PAdj        -log10PAdj  significance
0   SYNGR1      -4.6            5.200000e-217   216.283997  down
1   SEPT3       -4.6            4.500000e-204   203.346787  down
2   YWHAH       -2.5            4.700000e-191   190.327902  down
3   RPL3         1.7            5.400000e-134   133.267606  ns

Write this data frame to a csv file in the /data/username/pies_2023 folder, which should be the present working directory. Replace username with the user's Biowulf account ID. The to_csv command in Pandas is used to write data frames to csv files. Setting index=False ensures that the csv file will not have row names.

hbr_uhr_deg_chr22_1.to_csv("./hbr_uhr_deg_chr22_with_significance_lesson3.csv",index=False)

This lesson has shown the participants various data wrangling approaches using the Python package Pandas. The capabability of Pandas expand to more than what is covered here, participants are encouraged to check out the Pandas documentations to learn more.