Skip to content

Lesson 2: Python data types and structures

Learning objectives

After this class, participants will

  • Be able to describe some common Python data types and structures
  • Be able to identify Python data types
  • Become familiar with variable assignment
  • Be able to use conditional operators and if-else statements
  • Be able to load packages
  • Know how to import tabular data
  • Know how to view tabular data
  • Become familiar with constructing a for loop in Python

Signing onto Biowulf

Sign onto Biowulf using the ssh command. Replace username with user's Biowulf ID.

ssh username@biowul.nih.gov

Change into data directory and copy course data

Replace username with user's Biowulf ID.

cd /data/username

The cp command below will copy pies_2023_data in /data/classes/ to the user's data directory (denoted as "." as this should be present working directory) and save it as a folder called pies_2023.

cp -r /data/classes/BTEP/pies_2023_data ./pies_2023

Change into pies_2023.

cd pies_2023

Request interactive session

Stay in the /data/username/pies_2023 folder and request an interactive session using sinteractive with the following options.

  • --gres=lscratch:5: to allocate 5gb of local temporary/scratch storage space
  • --mem=2gb: to request 2gb of memory or RAM
  • --tunnel: to open up a channel of communication between local machine and Biowulf to allow interaction with applications like Jupyter Lab
sinteractive --gres=lscratch:5 --mem=2g --tunnel

After resources for the interactive session has been granted, users will see the information similar to that shown in Figure 1.

Figure 1: After interactive session resources have been allocated, users will see a ssh command that looks like that enclosed in the red rectangle. Open a new terminal (if working on a Mac) or command prompt (if working on a Windows computer) and then copy and paste this ssh command into the new terminal.

After copying and pasting the ssh command shown in Figure 1 to a new terminal or command prompt, hit enter to supply password and log in to Biowulf. This will complete the tunnel.

Figure 2: Hit enter after copying and pasting the ssh command to a new terminal to provide password and log into Biowulf. This will complete the tunnel.

Figure 3: In the ssh command shown in Figure 1 and Figure 2, the numbers preceding and following "localhost" will differ depending on user. Also, the Biowulf username will differ for each user (wuz8 is the instructor's Biowulf username).

Load Jupyter

Warning

Make sure to stay in the /data/username/pies_2023 folder for this step.

After the tunnel has been created, go back terminal (Mac) or command prompt (Windows) with the Biowulf interactive session and activate Jupyter (see Figure 4).

module load jupyter

Figure 4: Go back to the terminal (Mac) or command prompt (Windows) with the interactive session (look for cn#### at the prompt). Do module load jupyter from here.

Start Jupyter Lab

Warning

Make sure to stay in the /data/username/pies_2023 folder for this step.

Use the command below to start a Jupyter Lab session. Copy and paste either of the http links to a local browser to interact with Jupyter (see Figure 5).

jupyter lab --ip localhost --port $PORT1 --no-browser

Figure 5: Start a Jupyter lab session using jupyter lab --ip localhost --port $PORT1 --no-browser and copy and paste either one of the http links to a local browser.

Python data types and data structures

An important step to learning any new programming language and data analysis is to understand its data types and data structures. Common data types and structures that will be encountered include the following.

  • Text (str)
  • Numeric
    • int (ie. integers)
    • float (ie. decimals)
  • Boolean (True or False)
    • conditionals
    • filtering criteria
    • command options
  • Data frames
  • Lists
  • Arrays
  • Tuples
  • Range
  • Dictionaries

Identifying data type and structure in Python

The command type can be used to identify data types and structures in Python.

type(100)
int
type(3.1415926)
float
type("bioinformatics")
str

Variable assignments

In Python, variables are assigned to values using "=". Users can assign variables to integers, float, or string.

perfect=100
perfect
100
mole=6.02e23
mole
6.02e+23
btep_class="Python Introductory Education Series"
btep_class
'Python Introductory Education Series'

The command type(btep_class) will return str because the variable btep_class is text.

type(btep_class)
str

Conditionals

Conditionals evaluate the validity of certain conditions and operators include:

  • ==: is equal to?
  • >: is greater than?
  • >=: is greater than or equal to?
  • <: is less than?
  • <=: is less than or equal to?
  • !=: is not equal to?
  • and
  • or

The command below will evaluate if the variable perfect is equal to the variable mole and returns the Boolean value, False.

perfect==mole
False

If statements are also conditionals and are used to instruct the computer to do something if a condition is met. To have the computer do something when the condition is not met, use elif (else if) or else.

The command below will accomplish the following:

  • Use if to evaluate if perfect==mole, if yes then indicate using print that the two variables are equal
  • In the case that perfect does not equal mole, use elif (which stands for else if) to evaluate if perfect>mole, if yes then use the print statement to indicate that perfect is greater than mole
  • else when the previous two conditions are not met, use print to indicate that perfect is less than mole
if perfect==mole:
    print(perfect, "is equal to", mole)
elif perfect>mole:
    print(perfect, "is greater than", mole)
else:
    print(perfect, "is less than", mole)
100 is less than 6.02e+23

Note

The print command can be used to print variables by not enclosing in quotes.

A ":" is required after if, elif, and else. The command(s) to execute when conditions are met are placed on a separate line but tab indented.

Data frames

Often, in bioinformatics and data science, data comes in the form of rectangular tables, which are referred to as data frames. Data frames have the following property.

  • Study variable(s) form the columns
  • Observation(s) form rows
  • Can have a mix of data types (strings and numeric) but each column/study variable can contain only one data type
  • Limited to one value per cell

A popular package for working with data frames in Python is Pandas.

To load a Python package use the import command followed by the package name (ie. pandas).

import pandas

Sometimes the name of the package is long, so users might want to shorten it by creating an alias. The alias "pd" is often used for the Pandas package. To add an alias, just append as followed by the user defined alias to the package import command.

import pandas as pd

Importing tabular data with Pandas

This exercise will use the read_csv function of Pandas to import a comma separated value (csv) file called hbr_uhr_chr22_rna_seq_counts.csv, which contains RNA sequencing gene expression counts from the Human Brain Reference (hbr) and Universal Human Reference (uhr) study.

hbr_uhr_chr22_counts=pandas.read_csv("./hbr_uhr_chr22_rna_seq_counts.csv")

Note

If a Python package was imported using an alias (ie. pd for Pandas) then use the alias to call the package. For instance, pd.read_csv rather than pandas.read_csv when the pd alias is used for Pandas.

Take note of the way the csv import command is constructed. First the user specifies the name of package (ie. pandas) and then the function within the package (ie. read_csv). The package name and function name is separated by a period.

Next, use type to find out the data type or structure for hbr_uhr_chr22_counts.

type(hbr_uhr_chr22_counts)
pandas.core.frame.DataFrame

Take a look a the first few rows of hbr_uhr_chr22_counts.

hbr_uhr_chr22_counts.head()

Figure 1: Example of a data frame.

Because hbr_uhr_chr22_counts is a Pandas data frame, it is possible to append one of the many Pandas commands to it. For instance, the head function was appended to display the first five rows of hbr_uhr_chr22_counts. The data frame name and function is separated by a period. This is perhaps one of the most appealing aspects of Python syntax. Note that the head function was followed by (). If the parentheses is blank, then by default the first five lines will be shown. There will be more examples of the Pandas head function in a subsequent lesson.

Lists and tuples

Lists and tuples are one dimensional collections of data. The tuple is an immutable list, in which the elements cannot be modified.

To create a list, enclose the contents in square brackets.

sequencing_list=["whole genome", "rna", "whole exome"]

To create a tuple, enclose the contents in parentheses.

sequencing_tuple=("whole genome", "rna", "whole exome")

Lists and tuples are indexed and can contain duplicates. The first item in a list or tuple has an index of 0, the second item has an index of 1, and the last item has an index of n-1 where n is the number of items. Indices can be used to recall items in a list or tuple.

sequencing_list[1]
'rna'

List versus tuples (mutable versus immutable)

sequencing_list[1]="single cell RNA"
sequencing_list
['whole genome', 'single cell RNA', 'whole exome']
sequencing_tuple[1]="single cell RNA"
TypeError                                 Traceback (most recent call last)
Cell In[48], line 1
----> 1 sequencing_tuple[1]="single cell RNA"

TypeError: 'tuple' object does not support item assignment

Instructions for modifying Python lists can be found at the W3 school

Arrays

Given a list of numbers, it is difficult to perform mathematical operations. For instance

list_of_numbers=[1,2,3,4,5]

Multiplying list_of_numbers by 2 will duplicate this list. However, multiplying a list of numbers by two should double every number in that list. Thus, the expected result is [2,4,6,8,10]. To resolve this, convert the list to an array using the package numpy.

list_of_numbers*2
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

Use the array function of numpy to convert list_of_numbers to an array called array_of_numbers.

array_of_numbers=numpy.array(list_of_numbers)
array_of_numbers*2
array([ 2,  4,  6,  8, 10])

The array of numbers shown here is a one dimensional array. A special case of arrays is the matrix, which is two dimensional. Like data frames, matrices store values in columns and rows. Matrices are encountered in computation and are used to store numeric values (see here for more on matrices).

Range

Ranges can be used to for subsetting data (ie. extract data in rows 5 thru 10 of a data frame) or applied to iterate over a task in things like a for loop.

For instance, a for loop can be used to iterate over sequencing_list_new and print the 3rd to 5th entries.

sequencing_list_new=["whole genome", "rna", "whole exome","single cell rna", "chip", "atac", "cite", "single cell chip", "single cell atac"]
for i in range(2,5):
    print(sequencing_list_new[i])
whole exome
single cell rna
chip

Dictionaries

Dictionaries are key-value pairs and these are encountered as ways to specify options in some Python packages.

my_dictionary={"apples":"red","oranges":"orange","bananas":"yellow"}