Lesson 3: Data wrangling using Python
Learning objectives
After this lesson, participants will
- Be able to import tabular data into Python using Pandas
- Be able to explore and modify tabular data through various data wrangling approaches, including
- retrieving dimensions
- subsetting
- obtaining column statistics
- replacing column names
- performing mathematical operations
- filtering
- removing and adding columns
Importing tabular data using Pandas
Pandas is a popular Python package used to work with tabular data.
To work with Pandas, first activate it using the import
command.
import pandas
Sometimes the name of the package is long, so users might want to shorten it by creating an alias. The alias "pd" is often used for the Pandas package. To add an alias, just append as
followed by the user defined alias to the package import command. If importing a package using an alias, then the package needes to be called using the assigned alias. For instance, if pd
was used to import pandas, then use pd.read_csv
to import a csv file.
import pandas as pd
This exercise will use the read_csv
function of Pandas to import a comma separated value (csv) file called hbr_uhr_chr22_rna_seq_counts.csv, which contains RNA sequencing gene expression counts from the Human Brain Reference (hbr) and Universal Human Reference (uhr) study. This data will be stored as the variable hbr_uhr_chr22_counts.
hbr_uhr_chr22_counts=pandas.read_csv("./hbr_uhr_chr22_rna_seq_counts.csv")
Take a look at the first few rows of hbr_uhr_chr22_counts by appending the head
attribute to hbr_uhr_chr22_counts.
hbr_uhr_chr22_counts.head()
Figure 1: The first five rows of hbr_uhr_chr22_counts. The first column contains genes and the subsequent columns contain gene expression counts for each of the samples. The left most column of this data frame contains the row indices or names.
Because hbr_uhr_chr22_counts is a Pandas data frame (type(hbr_uhr_chr22_counts)
, see lesson 2), it is possible to append one of the many Pandas commands to it. For instance, the head
function was appended to display the first five rows of hbr_uhr_chr22_counts. The data frame name and function is separated by a period. This is perhaps one of the most appealing aspects of Python syntax. Note that the head
function was followed by ()
. If the parentheses are blank, then the default first five lines will be shown. To view the first 10 rows of hbr_uhr_chr22_counts do the following.
hbr_uhr_chr22_counts.head(10)
Figure 2: Include an integer inside the parentheses of pandas.dataframe.head()
function to view the specified number of lines in a tabular dataset.
The function tail
can be used to view by default the bottom five lines of a tabular dataset. Similar to head
, the number of lines shown can be customized by specifying an integer inside the parentheses.
hbr_uhr_chr22_counts.tail()
Get dimensions of a data frame
Pandas data frames have a function shape
that informs of the number of rows and number of columns in a data frame (in other words the dimensions). To get the dimensions for hbr_uhr_chr22_counts, do the following
hbr_uhr_chr22_counts.shape
The hbr_uhr_chr22_counts data frame has 1335 rows and 7 columns.
(1335, 7)
Note
The elements in tabular data can be referred to by their row and column positions.
The size
function returns the number elements in a data frame. For instance, hbr_uhr_chr22_counts has 1335 rows and 7 columns, which means that it has 1335 times 7 elements (or 9345).
Row indices/names
Figure 2 shows the first 10 rows of hbr_uhr_chr22_counts. The left most column, which contains labels starting with "0" is referred to as the row indices or row names. Users can specify a column in the dataset as the row indices or row names using the index_col
options in read_csv
. For instance, the hbr_uhr_chr22_rna_seq_counts.csv dataset could be imported with gene names as the row indices. To do this, add the index_col=0
option to read_csv
. Gene names in hbr_uhr_chr22_rna_seq_counts.csv is the first column and is denoted as column "0" in Python. Thus, setting index_col=0
ensures that the gene names will be set as the row indices or row names (see Figure 3).
hbr_uhr_chr22_counts_1=pandas.read_csv("./hbr_uhr_chr22_rna_seq_counts.csv", index_col=0)
Figure 3. The index_col=0
option in pandas.read_csv
sets the gene names as row names in the imported data frame.
Data wrangling
Subsetting
The command below will subset the expression counts for the RABL2B gene.
hbr_uhr_chr22_counts[hbr_uhr_chr22_counts["Geneid"]=="RABL2B"]
Geneid HBR_1.bam HBR_2.bam HBR_3.bam UHR_1.bam UHR_2.bam UHR_3.bam
1334 RABL2B 74 62 54 68 50 47
The "|" symbol can be used as the "or" operator so to also subset the counts for RPL23AP82
hbr_uhr_chr22_counts[(hbr_uhr_chr22_counts["Geneid"]=="RABL2B") | (hbr_uhr_chr22_counts["Geneid"]=="RPL23AP82")]
Geneid HBR_1.bam HBR_2.bam HBR_3.bam UHR_1.bam UHR_2.bam UHR_3.bam
1333 RPL23AP82 41 59 54 32 23 34
1334 RABL2B 74 62 54 68 50 47
Alternatively, use the isin
function and provide a list of genes to retrieve.
hbr_uhr_chr22_counts[hbr_uhr_chr22_counts["Geneid"].isin(["RABL2B", "RPL23AP82"])]
Use "." to reference a column.
hbr_uhr_chr22_counts[hbr_uhr_chr22_counts.Geneid=="RABL2B"]
Subsetting by integer positions
Given that the elements in a data frame are referenced by its row and column positions, what would be the approach for extracting the element in row 60 and column 5? The solution is the command below, which returns a result of 2. The row and column numbers are enclosed in "[]" and separated by a comma.
hbr_uhr_chr22_counts.iloc[60,5]
2
The above method for subsetting the element in row 60 and column 5 of hbr_uhr_chr22_counts is great if the goal is to extract the value and do numeric operation on it. But what if the user wants to return the element along with the corresponding gene in data frame format?
To do this, enclose the row and column indices to extract in their own inner set of square brackets as shown below. Column 0, which contains the gene name is also included in the brackets containing the column indices of interest.
hbr_uhr_chr22_counts.iloc[[60],[0,5]]
Geneid UHR_2.bam
60 CCT8L2 2
Pandas offers different approaches for subsetting rectangular data. One method is iloc
.
iloc
is a "purely integer-location based indexing for selection by position" -- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html#. The row and column positions are enclosed in "[]".
iloc
allows for retrieval of elements in multiple rows and columns. For instance, the following can be used to retrieve the elements in rows 60 and 65 and columns 0, 4, 5, and 6 in hbr_uhr_chr22_counts. Note that the row and column positions are enclosed in an outer set of "[]". Within the outer set of "[]" the first set of "[]" enclose a comma separated list of row positions while the second set of "[]" enclose a comma separated list of column positions.
hbr_uhr_chr22_counts.iloc[[60,65],[0,4,5,6]]
Geneid UHR_1.bam UHR_2.bam UHR_3.bam
60 CCT8L2 1 2 0
65 SLC25A15P5 2 2 4
To get the first three rows of hbr_uhr_chr22_counts do the following. Note that it retrieves the rows with indices 0, 1, and 2.
hbr_uhr_chr22_counts.iloc[:3]
Geneid HBR_1.bam HBR_2.bam HBR_3.bam UHR_1.bam UHR_2.bam UHR_3.bam
0 U2 0 0 0 0 0 0
1 CU459211.1 0 0 0 0 0 0
2 CU104787.1 0 0 0 0 0 0
What will be the output for hbr_uhr_chr22_counts.iloc[[3],:]
?
Solution
The row with an index of 3 will be retrieved.
Geneid HBR_1.bam HBR_2.bam HBR_3.bam UHR_1.bam UHR_2.bam UHR_3.bam
3 BAGE5 0 0 0 0 0 0
Subsetting using column names
Panda's loc
function allows for subsetting by row or column names. For instance, to retrieve the gene id column, do the following. The ":" denotes get every row.
hbr_uhr_chr22_counts.loc[:,['Geneid']]
Geneid
0 U2
1 CU459211.1
2 CU104787.1
3 BAGE5
4 ACTR3BP6
... ...
1330 ACR
1331 AC002056.5
1332 AC002056.3
1333 RPL23AP82
1334 RABL2B
To retrieve the counts for the gene SLC25A15P5, use the following where SLC25A15P5 is the subsetting criteria, where
hbr_uhr_chr22_counts.loc[:,'Geneid']
extracts the Geneid column.=="SLC25A15P5"
will filter out the row with the SLC25A15P5 gene.
hbr_uhr_chr22_counts[hbr_uhr_chr22_counts.loc[:,'Geneid']=="SLC25A15P5"]
Geneid HBR_1.bam HBR_2.bam HBR_3.bam UHR_1.bam UHR_2.bam UHR_3.bam
65 SLC25A15P5 0 0 0 2 2 4
To retrieve counts for more than one gene, enclose the genes of interest in a list and use the isin
function to filter out the rows containing the genes in the list.
hbr_uhr_chr22_counts[hbr_uhr_chr22_counts.loc[:,'Geneid'].isin(["SLC25A15P5", "CCT8L2"])]
Geneid HBR_1.bam HBR_2.bam HBR_3.bam UHR_1.bam UHR_2.bam UHR_3.bam
60 CCT8L2 0 0 0 1 2 0
65 SLC25A15P5 0 0 0 2 2 4
To find all of the SLC genes in hbr_uhr_chr22_counts, the following could be used where str.startswith
searches for text that starts a pattern (ie. "SLC"). Other options for pattern matching include str.endwith
and str.contains
.
hbr_uhr_chr22_counts.loc[hbr_uhr_chr22_counts.loc[:,'Geneid'].str.startswith("SLC")]
Geneid HBR_1.bam HBR_2.bam HBR_3.bam UHR_1.bam UHR_2.bam UHR_3.bam
54 SLC9B1P4 0 0 0 0 1 0
65 SLC25A15P5 0 0 0 2 2 4
109 SLC25A18 100 111 74 6 8 7
181 SLC25A1 32 50 41 226 138 216
249 SLC9A3P2 0 0 0 0 0 2
268 SLC7A4 19 25 14 9 4 3
494 SLC2A11 54 63 46 28 34 27
726 SLC35E4 18 32 26 21 12 13
783 SLC5A1 0 0 0 0 6 0
795 SLC5A4 7 12 5 13 9 4
955 SLC16A8 9 13 11 11 5 6
1046 SLC25A17 39 39 40 119 96 116
1099 SLC25A5P1 0 0 1 0 1 0
Summary statistics of data frames
hbr_uhr_chr22_counts.describe()
HBR_1.bam HBR_2.bam HBR_3.bam UHR_1.bam UHR_2.bam UHR_3.bam
count 1335.000000 1335.000000 1335.000000 1335.000000 1335.000000 1335.000000
mean 29.530337 36.264419 32.084644 50.694382 33.419476 40.334831
std 99.177874 120.617793 108.237694 197.575081 122.598310 154.455918
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000
75% 8.000000 10.000000 9.000000 13.000000 12.000000 11.000000
max 1532.000000 1797.000000 1637.000000 4027.000000 2406.000000 3047.000000
Replacing column names
To view the column headings of a data frame use the column
function. For instance,
hbr_uhr_chr22_counts.columns
HBR_1.bam
HBR_2.bam
HBR_3.bam
UHR_1.bam
UHR_2.bam
UHR_3.bam
The str.replace
function can be used to replace a string with something else. Here, it used to remove ".bam" from the sample names in the column heading.
hbr_uhr_chr22_counts.columns=hbr_uhr_chr22_counts.columns.str.replace(".bam", "")
Mathematical operations on data frames and filtering
Pandas enables mathematical operations on data frames. For instance, one might want to sum the total counts across all samples for each gene. The sum
function can be used to this. Setting axis=1
will sum up the counts for each row or gene. Because the Geneid column is a string, it is necessary to first subset only the sample columns.
hbr_uhr_chr22_counts.loc[:, ['HBR_1', 'HBR_2', 'HBR_3', 'UHR_1', 'UHR_2', 'UHR_3']].sum(axis=1)
Below, genes with zero counts across all samples are removed from hbr_uhr_chr22_counts and stored as hbr_uhr_chr22_counts_filtered. To accomplish this set hbr_uhr_chr22_counts.loc[:, ['HBR_1', 'HBR_2', 'HBR_3', 'UHR_1', 'UHR_2', 'UHR_3']].sum(axis=1) !=0
and use as a filter criteria.
hbr_uhr_chr22_counts_filtered=hbr_uhr_chr22_counts.loc[hbr_uhr_chr22_counts.loc[:, ['HBR_1', 'HBR_2', 'HBR_3', 'UHR_1', 'UHR_2', 'UHR_3']].sum(axis=1)!=0]
Removing and adding columns to a data frame
For this exercise, stay in the /data/username/pies_2023 folder, which should be the present working directory (use pwd
to check). If not in the /data/username/pies_2023 folder, change into it. Copy the hbr_uhr_deg_chr22.csv and hcc1395_deg_chr22.csv files from /data/classes/BTEP/pies_2023_data to the /data/username/pies_2023 directory.
cp /data/classes/BTEP/pies_2023_data/hbr_uhr_deg_chr22.csv .
cp /data/classes/BTEP/pies_2023_data/hcc1395_deg_chr22.csv .
The file hcc1395_deg_chr22.csv will be needed for the practice questions.
This exercise will use the differential gene expression analysis table from the hbr and uhr study.
hbr_uhr_deg_chr22=pandas.read_csv("./hbr_uhr_deg_chr22.csv")
The info()
function will retrieve information regarding the hbr_uhr_deg_chr22 data frame, which includes the column names.
hbr_uhr_deg_chr22.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1335 entries, 0 to 1334
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 1335 non-null object
1 baseMean 1335 non-null float64
2 baseMeanA 1335 non-null float64
3 baseMeanB 1335 non-null float64
4 foldChange 971 non-null float64
5 log2FoldChange 971 non-null float64
6 lfcSE 971 non-null float64
7 stat 971 non-null float64
8 PValue 971 non-null float64
9 PAdj 971 non-null float64
10 FDR 639 non-null float64
11 falsePos 639 non-null float64
12 HBR_1.bam 1335 non-null float64
13 HBR_2.bam 1335 non-null float64
14 HBR_3.bam 1335 non-null float64
15 UHR_1.bam 1335 non-null float64
16 UHR_2.bam 1335 non-null float64
17 UHR_3.bam 1335 non-null float64
dtypes: float64(17), object(1)
memory usage: 187.9+ KB
The hbr_uhr_deg_chr22 table contains differential gene expression analysis results. Relevant columns include
- name: gene names
- log2FoldChange: the gene expression change between the two treatment groups
- PAdj: the adjusted p-value associated with statistical confidence of the expression change
- The columns labeled with the sample names (ie. columns 12 through 17) are the normalized gene expression counts
Use str.replace
to remove ".bam" from the sample names in columns 12 through 17.
hbr_uhr_deg_chr22.columns=hbr_uhr_deg_chr22.columns.str.replace(".bam", "")
To drop columns in a Pandas data frame, use the .drop
function and specify the name(s) of the column(s) to remove. The example below removes columns baseMean, baseMeanA,and baseMeanB
hbr_uhr_deg_chr22.drop(columns=["baseMean","baseMeanA", "baseMeanB"])
Subset the name, log2FoldChange, and PAdj columns in hbr_uhr_deg_chr22 and save to a new data frame hbr_uhr_deg_chr22_1.
hbr_uhr_deg_chr22_1=hbr_uhr_deg_chr22.loc[:,["name", "log2FoldChange", "PAdj"]]
hbr_uhr_deg_chr22_1.head()
name log2FoldChange PAdj
0 SYNGR1 -4.6 5.200000e-217
1 SEPT3 -4.6 4.500000e-204
2 YWHAH -2.5 4.700000e-191
3 RPL3 1.7 5.400000e-134
4 PI4KA -2.0 2.900000e-118
Next, add a column called "-log10PAdj" to hbr_uhr_deg_chr22_1, which will contain the negative of log10 of the values in the PAdj column. "-log10PAdj" is used in volcano plots that depict gene expression change versus statistical confidence. To calculate -log10PAdj, the package numpy will be used. Numpy enables scientific calculations.
import numpy
hbr_uhr_deg_chr22_1["-log10PAdj"]=numpy.negative(numpy.log10(hbr_uhr_deg_chr22_1.loc[:,"PAdj"]))
Take a look at the first several lines of hbr_uhr_deg_chr22_1
hbr_uhr_deg_chr22_1.head()
name log2FoldChange PAdj -log10PAdj
0 SYNGR1 -4.6 5.200000e-217 216.283997
1 SEPT3 -4.6 4.500000e-204 203.346787
2 YWHAH -2.5 4.700000e-191 190.327902
3 RPL3 1.7 5.400000e-134 133.267606
4 PI4KA -2.0 2.900000e-118 117.537602
Other methods for adding new column to a Pandas data frame include insert
and assign
.
The final task for this lesson is to add a column that indicates whether a gene is up regulated, down regulated, or has no change based on the log2FoldChange and PAdj values. The criteria are as follows.
- PAdj >= 0.01: no change (marked as ns in the column)
- Absolute value of log2FoldChange <2: no change (marked as ns in the column)
- log2FoldChange >= 2 and PAdj < 0.01: (up regulated)
- log2FoldChange <=2 and PAdj < 0.01: (down regulated)
To code this in Python, the first step is to drop the NA values from the hbr_uhr_deg_chr22_1 using dropna
.
hbr_uhr_deg_chr22_1=hbr_uhr_deg_chr22_1.dropna()
Next, create a list called significance_criteria that contains the criteria shown above. In the criteria list below, "&" is the Boolean for "and". To calculate the absolute value of log2FoldChange, numpy.absolute
is used.
significance_criteria=[(hbr_uhr_deg_chr22_1["PAdj"]>=0.01),
(numpy.absolute(hbr_uhr_deg_chr22_1["log2FoldChange"])<2),
(hbr_uhr_deg_chr22_1["log2FoldChange"]>=2) & (hbr_uhr_deg_chr22_1["PAdj"]<0.01),
(hbr_uhr_deg_chr22_1["log2FoldChange"]<=-2) & (hbr_uhr_deg_chr22_1["PAdj"]<0.01)]
Then, create a list called significance_status that indicates whether the criteria are ns (not significant), up, or down. These statuses have to correspond to the order in which the criteria were listed in significance_criteria.
significance_status=["ns","ns","up","down"]
Finally, numpy.select
will be used to assign values to the significance column.
hbr_uhr_deg_chr22_1["significance"]=numpy.select(significance_criteria,significance_status)
hbr_uhr_deg_chr22_1.head(4)
name log2FoldChange PAdj -log10PAdj significance
0 SYNGR1 -4.6 5.200000e-217 216.283997 down
1 SEPT3 -4.6 4.500000e-204 203.346787 down
2 YWHAH -2.5 4.700000e-191 190.327902 down
3 RPL3 1.7 5.400000e-134 133.267606 ns
Write this data frame to a csv file in the /data/username/pies_2023 folder, which should be the present working directory. Replace username with the user's Biowulf account ID. The to_csv
command in Pandas is used to write data frames to csv files. Setting index=False
ensures that the csv file will not have row names.
hbr_uhr_deg_chr22_1.to_csv("./hbr_uhr_deg_chr22_with_significance_lesson3.csv",index=False)
This lesson has shown the participants various data wrangling approaches using the Python package Pandas. The capabability of Pandas expand to more than what is covered here, participants are encouraged to check out the Pandas documentations to learn more.