Skip to content

Python data types, loops and iterators

Learning objectives

After this class, participants will

  • Be able to describe Python data types and structures
  • Become familiar with variable assignment
  • Be able to use conditional operators and if-else statements
  • Understand how loops and iterators can be used automate processes
  • Be able to load packages
  • Know how to import tabular data
  • Know how to view tabular data

Start a Jupyter Lab session

Before getting started, make sure to start a Jupyter Lab session with the default resources via HPC OnDemand.

Hint

Be sure to start the Jupyter Lab session in `/data/$USER/pies_class_2025'. Where $USER is the environmental variable that points to the participant's Biowulf user ID.

Next, click on pies_class_2025.ipynb in the file explorer to open it.

Python data types and data structures

An important step to learning any new programming language and data analysis is to understand its data types and data structures. Common data types and structures that will be encountered include the following.

  • Text (str)
  • Numeric
    • int (ie. integers)
    • float (ie. decimals)
  • Boolean (True or False)
    • conditionals
    • filtering criteria
    • command options
  • Data frames
  • Lists
  • Arrays
  • Tuples
  • Range
  • Dictionaries

Identifying data type and structure in Python

The command type can be used to identify data types and structures in Python.

type(100)
int
type(3.1415926)
float
type("bioinformatics")
str

Variable assignments

In Python, variables are assigned to values using "=".

test1_score100
test1_score
100
mole=6.02e23
mole
6.02e+23
btep_class="Python Introductory Education Series"
btep_class
'Python Introductory Education Series'

The command type(btep_class) will return str because the variable btep_class is text.

type(btep_class)
str

It is also possible assign a variable to another variable.

test2_score=test1_score
test2_score
100

Change the value of test2_score to 60.

test2_score=60
test2_score
60
test1_score
100
print("The student got a", test2_score, "on exam 2.")

Definition

Immutable objects in Python are variables whose values cannot be changed after they have been created. This includes integers, floats, strings, and tuples. In the above example, test2_score was initially set to test1_score. However, upon changing test2_score to 60, the value of test1_score does not change. Thus, demonstrating that integers are immutable.

Conditionals

Conditionals evaluate the validity of certain conditions and operators include:

  • ==: is equal to?
  • >: is greater than?
  • >=: is greater than or equal to?
  • <: is less than?
  • <=: is less than or equal to?
  • !=: is not equal to?
  • and
  • or

The command below will evaluate if test1_score is equal to test2_score.

test1_score==test2_score

Because test1_score is 100 and test2_score is 60, the result from the above command will be false.

False

If statements are also conditionals and are used to instruct the computer to do something if a condition is met. To have the computer do something when the condition is not met, use elif (else if) or else.

The command below will accomplish the following:

  • Use if to evaluate if test1_score>=90, if yes then indicate using print that someone got an A!
  • Use elif (which stands for else if) to evaluate if test2_score>=80, if yes then use the print statement to indicate that someone does not have to take the final!
  • Finally, else will print for all other conditions that someone failed the class.
if test1_score>=90:
    print("You get an A!")
elif test2_score>=80:
    print("You don't have to take the final!")
else:
    print("You failed the class!")

Tip

The print command can be used to print variables by not enclosing in quotes.

A ":" is required after if, elif, and else. The command(s) to execute when conditions are met are placed on a separate line but tab indented.

Data frames

Often, in bioinformatics and data science, data comes in the form of rectangular tables, which are referred to as data frames. Data frames have the following property.

  • Study variable(s) form the columns
  • Observation(s) form rows
  • Can have a mix of data types (strings and numeric) but each column/study variable can contain only one data type
  • Limited to one value per cell

A popular package for working with data frames in Python is Pandas.

To load a Python package use the import command followed by the package name (ie. pandas).

import pandas

Sometimes the name of the package is long, so users might want to shorten it by creating an alias. The alias "pd" is often used for the Pandas package. To add an alias, just append as followed by the user defined alias to the package import command.

import pandas as pd

Importing tabular data with Pandas

This exercise will use the read_csv function of Pandas to import a comma separated value (csv) file called hbr_uhr_chr22_rna_seq_counts.csv, which contains RNA sequencing gene expression counts from the Human Brain Reference (hbr) and Universal Human Reference (uhr) study.

hbr_uhr_chr22_counts=pandas.read_csv("./hbr_uhr_chr22_rna_seq_counts.csv")

Note

If a Python package was imported using an alias (ie. pd for Pandas) then use the alias to call the package. For instance, pd.read_csv rather than pandas.read_csv when the pd alias is used for Pandas.

Take note of the way the csv import command is constructed. First the user specifies the name of package (ie. pandas) and then the function within the package (ie. read_csv). The package name and function name is separated by a period.

Next, use type to find out the data type or structure for hbr_uhr_chr22_counts.

type(hbr_uhr_chr22_counts)
pandas.core.frame.DataFrame

Take a look a the first few rows of hbr_uhr_chr22_counts.

hbr_uhr_chr22_counts.head()

Figure 1: Example of a data frame.

Because hbr_uhr_chr22_counts is a Pandas data frame, it is possible to append one of the many Pandas commands to it. For instance, the head function was appended to display the first five rows of hbr_uhr_chr22_counts. The data frame name and function is separated by a period. This is perhaps one of the most appealing aspects of Python syntax. Note that the head function was followed by (). If the parentheses is blank, then by default the first five lines will be shown. There will be more examples of the Pandas head function in a subsequent lesson.

Lists and tuples

Lists and tuples are one dimensional collections of data. The tuple is an immutable list, in which the elements cannot be modified. However, lists are mutable.

To create a list, enclose the contents in square brackets.

sequencing_list=["whole genome", "rna", "whole exome"]

To create a tuple, enclose the contents in parentheses.

sequencing_tuple=("whole genome", "rna", "whole exome")

Lists and tuples are indexed and can contain duplicates. The first item in a list or tuple has an index of 0 (ie. Python uses a 0 based indexing system), the second item has an index of 1, and the last item has an index of n-1 where n is the number of items. Indices can be used to recall items in a list or tuple.

sequencing_list[1]
'rna'

What if users wanted to extract the first two items in sequencing list?

sequencing_list[0:2]
['whole genome', 'rna']

But will the following work?

sequencing_list[0,1]

No, there is an error. More on this in section that covers loops and iterators.

TypeError                                 Traceback (most recent call last)
Cell In[61], line 1
----> 1 sequencing_list[0,1]

TypeError: list indices must be integers or slices, not tuple

List versus tuples (mutable versus immutable)

sequencing_list[1]="single cell RNA"
sequencing_list
['whole genome', 'single cell RNA', 'whole exome']
sequencing_tuple[1]="single cell RNA"
TypeError                                 Traceback (most recent call last)
Cell In[48], line 1
----> 1 sequencing_tuple[1]="single cell RNA"

TypeError: 'tuple' object does not support item assignment

Making a copy of a list

Suppose there is a list called list1 that contains the following numbers.

list1=[1,2,3,4,5]
list1
[1, 2, 3, 4, 5]

Next, create copy of list1 was made and assigned to variable list2.

list2=list1
list2
[1, 2, 3, 4, 5]

Then insert 0 as the first item in list2.

list2[0]=0
list2
[0, 1, 2, 3, 4, 5]

When assigning list2 to list1 using =, Python will point list2 to the values stored in list1 (ie. list1 and list2 are referencing the same list). Because lists are mutable, the changes to list2 are reflected in list1 as well.

[0, 1, 2, 3, 4, 5]

Set list1 back to [1,2,3,4,5].

list1=[1,2,3,4,5]

Next, use the deepcopy module from the Python package copy to make a copy of list1 called list2. To call a module within a Python package follow this general syntax of package.module. For instance, to call deepcopy use copy.deepcopy.

import copy
list2=copy.deepcopy(list1)
list2

Set the first element of list2 to 0.

list2[0]=0
list2
[0, 1, 2, 3, 4, 5]

Finally, recall list1.

list1
[1, 2, 3, 4, 5]

There actually two types of copies in Python. One is called shallow copy and the other is deep copy. To create a shallow copy of list1 and store is list2, just do list2=list1.copy(). However, caution still need to taken when shallow copying as this could also lead to unintended changes to the original variable. To create an independent copy of a variable, use deep copy. See https://www.geeksforgeeks.org/copy-python-deep-copy-shallow-copy/# to learn more.

Source: https://www.geeksforgeeks.org/copy-python-deep-copy-shallow-copy/#

Source: https://www.geeksforgeeks.org/copy-python-deep-copy-shallow-copy/#

Instructions for modifying Python lists can be found at the W3 school

Arrays

Given a list of numbers, it is difficult to perform mathematical operations. For instance

list_of_numbers=[1,2,3,4,5]

Multiplying list_of_numbers by 2 will duplicate this list. However, multiplying a list of numbers by two should double every number in that list. Thus, the expected result is [2,4,6,8,10]. To resolve this, convert the list to an array using the package numpy.

list_of_numbers*2
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

Use the array function of numpy to convert list_of_numbers to an array called array_of_numbers.

array_of_numbers=numpy.array(list_of_numbers)
array_of_numbers*2
array([ 2,  4,  6,  8, 10])

The array of numbers shown here is a one dimensional array. A special case of arrays is the matrix, which is two dimensional. Like data frames, matrices store values in columns and rows. Matrices are encountered in computation and are used to store numeric values (see here for more on matrices).

Loops and iterators

Loops and iterators are great for performing repeated tasks. In Python, users will see for and while loops. To learn about loops, first add a few more items the sequencing_list. To add multiple items to Python lists, just use the .extend attribute.

sequencing_list.extend(["chip", "atac"])
sequencing_list
['whole genome', 'rna', 'whole exome', 'chip', 'atac']

The following for loop will print elements with index 2, 3, and 4 from sequencing_list and can be explained as follows.

  • for is a type of loop to iterate over repetitive tasks in Python. To use the for loop,
    • An index is needed to keep track of where in the repetitive task the loop is in. For instanced, this index can inform the loop which item in a list that it is currently performing a task on. The index can be named anything. This example will use i as it is very common across computing.
    • Next, the loop needs to know the starting and ending point for the repetitive task. The example below uses a range of 2 through 5. Thus, the index i will initially take on the value of 2, then increment by 1 in each pass of the loop and stop when i equals 5.
    • A ":" follows for loop line. The action for the for loop is written in the next line but tab indented. In the example below, the action is the print the ith item in the sequencing_list.
for i in range(2,5):
    print(sequencing_list[i])
whole exome
chip
atac

The start and end in a for loop does not necessarily need to numeric. The following will loop through sequencing_list and print each element. In the loop below, sequence_type is set as the index.

for sequence_type in sequencing_list:
    print(sequence_type)
whole genome
rna
whole exome
chip
atac

There is also the while loop. The example below will print the first four items in sequencing list using while. Just like for loop, the while loop needs an index to help it keep track of where it is at in the task. Here, the index is i and it is initiated with the value 0 outside the while loop. Next, the while loop will proceed to print the ith item in sequencing_list as long as i is less than 4. The index i is incremented by 1 in the while loop.

i=0
while i < 4:
    print(sequencing_list[i])
    i=i+1
whole genome
rna
whole exome
chip

What would happen if i was initialized to 4 and the while loop would iterate until i is equal 0.

i=4
while i >= 0:
    print(sequencing_list[i])
    i=i-1

The above while loop will just print the items in sequencing_list in reverse order.

atac
chip
whole exome
rna
whole genome

A for loop can be used to solve the issue why sequencing_list[0,1] did not work to subset the first and second items in sequencing_list. In the command construct below, to_subset will hold a list containing 0 and 1, which correspond the indices for the first and second item in sequencing_list. In the following line, sequencing_list[i] will subset the ith item in sequencing_list but only those indices included in to_subset, which the for loop will iterate through.

to_subset=[0,1]
[sequencing_list[i] for i in to_subset]
['whole genome', 'rna']

To subset the first and second item in sequencing_list, the map command can be used.

Definition

"The map() function is used to apply a given function to every item of an iterable, such as a list or tuple, and returns a map object (which is an iterator)." -- https://www.geeksforgeeks.org/python-map-function/?ref=lbp

list(map(sequencing_list.__getitem__, [0,1]))
['whole genome', 'rna']

What if the user wanted to add the word "sequencing" at the end of each sequencing type in sequencing_list? To this, the map function can be used to iterate through sequencing_list and lambda can be used to execute the function that adds " sequencing" to the end of every item in sequencing_list.

Definition

"A lambda function is a small anonymous function. A lambda function can take any number of arguments, but can only have one expression." -- https://www.w3schools.com/python/python_lambda.asp

In the example below, lambda is used to define a function that adds " sequencing" to whatever value is passed onto the variable sl. In this instance, sequencing_list, the last argument in the map function is passed to sl.

list(map(lambda sl: sl+" sequencing", sequencing_list))
['whole genome sequencing', 'rna sequencing', 'whole exome sequencing', 'chip sequencing',
 'atac sequencing']

Another example of combining map and lambda to iterate over a task is shown in the commands below where every entry in numbers_list will be square.

numbers_list1=[1,2,3,4,5,6]
list(map(lambda j: j**2, numbers_list1))
numbers_list1
[1, 4, 9, 16, 25, 36]

An alternative for squaring every element in numbers_list1 is to use list comprehension, which will essentially allow the use of one liner for loop to complete the task.

numbers_list1=[1,2,3,4,5,6]
numbers_list1=list(j**2 for j in numbers_list1)
numbers_list1
[1, 4, 9, 16, 25, 36]

Dictionaries

Dictionaries are key-value pairs and these are encountered as ways to specify options in some Python packages.

my_dictionary={"apples":"red","oranges":"orange","bananas":"yellow"}

Subsetting a dictionary

There are several methods for subsetting a dictionary. See https://www.geeksforgeeks.org/get-a-subset-of-dict-in-python/.

First, just enclosing one of the keys in square brackets will retrieve its associated value.

my_dictionary['bananas']
yellow

A for loop can be used to subset a dictionary as well. In the example below, a new dictionary called apples_bananas is created just to hold the key and value pairs for apples and bananas in my_dictionary. To do this, follow the steps below.

  1. Create any variable with a list that contains dictionary keys to extract. In this example, the variable will be named keys_to_extract and the list will contain apples and bananas, which are keys in my_dictionary.
  2. Next, create an empty dictionary called apples_bananas by setting to empty {}.
  3. In the for loop, iterate through keys_to_extract using the variable k to keep track of progress. If k is in my_dictionary, then use the dictionary's .update attribute to write it into apples_bananas. apples_bananas can be written to because Python dictionaries are mutable.
keys_to_extract = ['apples', 'bananas']
apples_bananas={}
for k in keys_to_extract:
    if k in my_dictionary:
        apples_bananas.update({k: my_dictionary[k]})
apples_bananas
{'apples': 'red', 'bananas': 'yellow'}

The above for loop can be condensed to a one liner using dictionary comprehension.

keys_to_extract = ['apples', 'bananas']
apples_bananas={k: my_dictionary[k] for k in keys_to_extract if k in my_dictionary}

An alternative to using a for loop is Python's zip and map commands.

Definition

"The zip() function in Python combines multiple iterables such as lists, tuples, strings, dict etc, into a single iterator of tuples. Each tuple contains elements from the input iterables that are at the same position." -- https://www.geeksforgeeks.org/zip-in-python/

To demonstrate zip, consider the lists below.

a1=[1,2,3]
a2=[3,4,5]
list(zip(a1,a2))

A list where the first, second, and third items in a1 and a2 are paired together.

[(1, 3), (2, 4), (3, 5)]

Next, recall that the map command takes an iterable item like a list and performs a certain function with it.

keys_to_extract = ['apples', 'bananas']
list(map(my_dictionary.get,keys_to_extract))

The above commands will return a list with values for apples and bananas in my_dictionary where the map function will use the dictionary's .get attribute to retrieve values for keys list in keys_to_extract.

['red', 'yellow']

Given that zip will perform element-wise combination on iterable items such as list, it can be used to generate key and value pairs from keys_to_extract and my_dictionary using the command below where dict is used to specify creation of a dictionary.

dict(zip(keys_to_extract, map(my_dictionary.get, keys_to_extract)))
{'apples': 'red', 'bananas': 'yellow'}

Updating a dictionary

Use the a dictionary's update attribute to add values.

my_dictionary.update({'pears': 'green'})

OR

my_dictionary['pears']='green'
{'apples': 'red', 'oranges': 'orange', 'bananas': 'yellow', 'pears': 'green'}

To add multiple items to a dictionary, use .update.

my_dictionary.update({'avocado': 'green', 'kiwis': 'brown'})
{'apples': 'red', 'oranges': 'orange', 'bananas': 'yellow', 'pears': 'green', 'avocado': 'green', 'kiwis': 'brown'}

The dictionary's .pop attribute can be used to remove an item.

my_dictionary.pop('pears')
{'apples': 'red', 'oranges': 'orange', 'bananas': 'yellow', 'pears': 'green', 'kiwis': 'brown'}

To delete multiple items, just create a list of keys to remove and assign this list to a variable. Below, keys_to_remove will be used to store avocado and kiwis, which are keys from my_dictionary to remove.

keys_to_remove=['avocado', 'kiwis']
list(map(my_dictionary.pop, keys_to_remove))
{'apples': 'red', 'oranges': 'orange', 'bananas': 'yellow'}