Lesson 2: Python data types and structures
Learning objectives
After this class, participants will
- Be able to describe some common Python data types and structures
- Be able to identify Python data types
- Become familiar with variable assignment
- Be able to use conditional operators and if-else statements
- Be able to load packages
- Know how to import tabular data
- Know how to view tabular data
- Become familiar with constructing a
for
loop in Python
Signing onto Biowulf
Sign onto Biowulf using the ssh
command. Replace username with user's Biowulf ID.
ssh username@biowul.nih.gov
Change into data directory and copy course data
Replace username with user's Biowulf ID.
cd /data/username
The cp
command below will copy pies_2023_data in /data/classes/ to the user's data directory (denoted as "." as this should be present working directory) and save it as a folder called pies_2023.
cp -r /data/classes/BTEP/pies_2023_data ./pies_2023
Change into pies_2023.
cd pies_2023
Request interactive session
Stay in the /data/username/pies_2023 folder and request an interactive session using sinteractive
with the following options.
--gres=lscratch:5
: to allocate 5gb of local temporary/scratch storage space--mem=2gb
: to request 2gb of memory or RAM--tunnel
: to open up a channel of communication between local machine and Biowulf to allow interaction with applications like Jupyter Lab
sinteractive --gres=lscratch:5 --mem=2g --tunnel
After resources for the interactive session has been granted, users will see the information similar to that shown in Figure 1.
Figure 1: After interactive session resources have been allocated, users will see a ssh
command that looks like that enclosed in the red rectangle. Open a new terminal (if working on a Mac) or command prompt (if working on a Windows computer) and then copy and paste this ssh
command into the new terminal.
After copying and pasting the ssh
command shown in Figure 1 to a new terminal or command prompt, hit enter to supply password and log in to Biowulf. This will complete the tunnel.
Figure 2: Hit enter after copying and pasting the ssh
command to a new terminal to provide password and log into Biowulf. This will complete the tunnel.
Figure 3: In the ssh
command shown in Figure 1 and Figure 2, the numbers preceding and following "localhost" will differ depending on user. Also, the Biowulf username will differ for each user (wuz8 is the instructor's Biowulf username).
Load Jupyter
Warning
Make sure to stay in the /data/username/pies_2023 folder for this step.
After the tunnel has been created, go back terminal (Mac) or command prompt (Windows) with the Biowulf interactive session and activate Jupyter (see Figure 4).
module load jupyter
Figure 4: Go back to the terminal (Mac) or command prompt (Windows) with the interactive session (look for cn#### at the prompt). Do module load jupyter
from here.
Start Jupyter Lab
Warning
Make sure to stay in the /data/username/pies_2023 folder for this step.
Use the command below to start a Jupyter Lab session. Copy and paste either of the http links to a local browser to interact with Jupyter (see Figure 5).
jupyter lab --ip localhost --port $PORT1 --no-browser
Figure 5: Start a Jupyter lab session using jupyter lab --ip localhost --port $PORT1 --no-browser
and copy and paste either one of the http links to a local browser.
Python data types and data structures
An important step to learning any new programming language and data analysis is to understand its data types and data structures. Common data types and structures that will be encountered include the following.
- Text (str)
- Numeric
- int (ie. integers)
- float (ie. decimals)
- Boolean (True or False)
- conditionals
- filtering criteria
- command options
- Data frames
- Lists
- Arrays
- Tuples
- Range
- Dictionaries
Identifying data type and structure in Python
The command type
can be used to identify data types and structures in Python.
type(100)
int
type(3.1415926)
float
type("bioinformatics")
str
Variable assignments
In Python, variables are assigned to values using "=". Users can assign variables to integers, float, or string.
perfect=100
perfect
100
mole=6.02e23
mole
6.02e+23
btep_class="Python Introductory Education Series"
btep_class
'Python Introductory Education Series'
The command type(btep_class)
will return str
because the variable btep_class is text.
type(btep_class)
str
Conditionals
Conditionals evaluate the validity of certain conditions and operators include:
==
: is equal to?>
: is greater than?>=
: is greater than or equal to?<
: is less than?<=
: is less than or equal to?!=
: is not equal to?and
or
The command below will evaluate if the variable perfect is equal to the variable mole and returns the Boolean value, False.
perfect==mole
False
If statements are also conditionals and are used to instruct the computer to do something if a condition is met. To have the computer do something when the condition is not met, use elif
(else if) or else
.
The command below will accomplish the following:
- Use
if
to evaluate if perfect==mole, if yes then indicate usingprint
that the two variables are equal - In the case that perfect does not equal mole, use
elif
(which stands for else if) to evaluate if perfect>mole, if yes then use theprint
statement to indicate that perfect is greater than mole else
when the previous two conditions are not met, useprint
to indicate that perfect is less than mole
if perfect==mole:
print(perfect, "is equal to", mole)
elif perfect>mole:
print(perfect, "is greater than", mole)
else:
print(perfect, "is less than", mole)
100 is less than 6.02e+23
Note
The print
command can be used to print variables by not enclosing in quotes.
A ":" is required after if
, elif
, and else
. The command(s) to execute when conditions are met are placed on a separate line but tab indented.
Data frames
Often, in bioinformatics and data science, data comes in the form of rectangular tables, which are referred to as data frames. Data frames have the following property.
- Study variable(s) form the columns
- Observation(s) form rows
- Can have a mix of data types (strings and numeric) but each column/study variable can contain only one data type
- Limited to one value per cell
A popular package for working with data frames in Python is Pandas.
To load a Python package use the import
command followed by the package name (ie. pandas).
import pandas
Sometimes the name of the package is long, so users might want to shorten it by creating an alias. The alias "pd" is often used for the Pandas package. To add an alias, just append as
followed by the user defined alias to the package import command.
import pandas as pd
Importing tabular data with Pandas
This exercise will use the read_csv
function of Pandas to import a comma separated value (csv) file called hbr_uhr_chr22_rna_seq_counts.csv, which contains RNA sequencing gene expression counts from the Human Brain Reference (hbr) and Universal Human Reference (uhr) study.
hbr_uhr_chr22_counts=pandas.read_csv("./hbr_uhr_chr22_rna_seq_counts.csv")
Note
If a Python package was imported using an alias (ie. pd for Pandas) then use the alias to call the package. For instance, pd.read_csv
rather than pandas.read_csv
when the pd alias is used for Pandas.
Take note of the way the csv import command is constructed. First the user specifies the name of package (ie. pandas) and then the function within the package (ie. read_csv). The package name and function name is separated by a period.
Next, use type
to find out the data type or structure for hbr_uhr_chr22_counts.
type(hbr_uhr_chr22_counts)
pandas.core.frame.DataFrame
Take a look a the first few rows of hbr_uhr_chr22_counts.
hbr_uhr_chr22_counts.head()
Figure 1: Example of a data frame.
Because hbr_uhr_chr22_counts is a Pandas data frame, it is possible to append one of the many Pandas commands to it. For instance, the head
function was appended to display the first five rows of hbr_uhr_chr22_counts. The data frame name and function is separated by a period. This is perhaps one of the most appealing aspects of Python syntax. Note that the head
function was followed by ()
. If the parentheses is blank, then by default the first five lines will be shown. There will be more examples of the Pandas head
function in a subsequent lesson.
Lists and tuples
Lists and tuples are one dimensional collections of data. The tuple is an immutable list, in which the elements cannot be modified.
To create a list, enclose the contents in square brackets.
sequencing_list=["whole genome", "rna", "whole exome"]
To create a tuple, enclose the contents in parentheses.
sequencing_tuple=("whole genome", "rna", "whole exome")
Lists and tuples are indexed and can contain duplicates. The first item in a list or tuple has an index of 0, the second item has an index of 1, and the last item has an index of n-1 where n is the number of items. Indices can be used to recall items in a list or tuple.
sequencing_list[1]
'rna'
List versus tuples (mutable versus immutable)
sequencing_list[1]="single cell RNA"
sequencing_list
['whole genome', 'single cell RNA', 'whole exome']
sequencing_tuple[1]="single cell RNA"
TypeError Traceback (most recent call last)
Cell In[48], line 1
----> 1 sequencing_tuple[1]="single cell RNA"
TypeError: 'tuple' object does not support item assignment
Instructions for modifying Python lists can be found at the W3 school
Arrays
Given a list of numbers, it is difficult to perform mathematical operations. For instance
list_of_numbers=[1,2,3,4,5]
Multiplying list_of_numbers by 2 will duplicate this list. However, multiplying a list of numbers by two should double every number in that list. Thus, the expected result is [2,4,6,8,10]. To resolve this, convert the list to an array using the package numpy.
list_of_numbers*2
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
Use the array
function of numpy to convert list_of_numbers to an array called array_of_numbers.
array_of_numbers=numpy.array(list_of_numbers)
array_of_numbers*2
array([ 2, 4, 6, 8, 10])
The array of numbers shown here is a one dimensional array. A special case of arrays is the matrix, which is two dimensional. Like data frames, matrices store values in columns and rows. Matrices are encountered in computation and are used to store numeric values (see here for more on matrices).
Range
Ranges can be used to for subsetting data (ie. extract data in rows 5 thru 10 of a data frame) or applied to iterate over a task in things like a for
loop.
For instance, a for
loop can be used to iterate over sequencing_list_new and print the 3rd to 5th entries.
sequencing_list_new=["whole genome", "rna", "whole exome","single cell rna", "chip", "atac", "cite", "single cell chip", "single cell atac"]
for i in range(2,5):
print(sequencing_list_new[i])
whole exome
single cell rna
chip
Dictionaries
Dictionaries are key-value pairs and these are encountered as ways to specify options in some Python packages.
my_dictionary={"apples":"red","oranges":"orange","bananas":"yellow"}