Getting Started with Python
Joe Wu, PhD
NCI/CCR Bioinformatics Training and Education Program
ncibtep@nih.gov
Lesson 1 Learning Objectives
After this class, participants will be able to:
- Describe Python and provide rationale for using Python
- List tools for interacting with Python
- Sign onto Biowulf, start a Jupyter Lab session, and become familiar with the Jupyter Lab interface.
- Describe Python command syntax
- Describe where to get and how to install external packages
- Get help for Python commands
Why use Python?
- General purpose scripting language
- Analyze and visualize large datasets
- Reusability and reproducibility
- Versioning and keeping track of changes is possible when analyzing data using scripts
- Easy to learn
- External packages that enhances functionality
- Large community support
Python enables Elegant Data Visualization
An abundance of external packages make scientific computing and data presentation easy. For instance, the packages matplotlib and seaborn are good tools for generating data visualizations. With a few lines of code, scientists can generate scatter plots to view relationship between variables and/or heatmaps that can reveal distinct clusters in a dataset.
Generating a Scatter Plot using Matplotlib
import matplotlib.pyplot as plt
import numpy
x=numpy.array([0,1,2,3,4,5,6,7,8])
y=numpy.array([0.5,2,5,6,7,10,13,14,16])
plt.scatter(x,y)
slope, intercept=numpy.polyfit(x,y,1)
plt.plot(x,slope*x+intercept)
plt.text(1,14,'y='+str(round(slope,3))+'x' ' + ' + str(round(intercept,3)))
plt.xlabel('x')
plt.ylabel('y')
Generating a Gene Expression Heatmap using Seaborn
import pandas
import seaborn
import matplotlib.pyplot as plt
counts1=pandas.read_csv("./hbr_uhr_top_deg_normalized_counts.csv", index_col=[0])
seaborn.clustermap(counts1,z_score=0,cmap="viridis", figsize=(5,5))
plt.suptitle("Gene expression heatmap",y=1.1)
plt.show()
Tools for Interacting with Python
- Python can be run at the command prompt
- Ipython
- Run python script at the command prompt
- Integrated Development Environments such as:
- Visual Studio Code from Microsoft has extensions that support Python scripting
- R Studio
- Juptyer Lab/Notebook
Python at the Command Prompt
Assuming Python is installed, just type python
at the command prompt to start using Python. Hit control-d to exit back to the command prompt. The downside to this is that users cannot save the commands into a script.
Ipython
Ipython enables users to run Python commands interactively at the terminal. It features autocomplete of commands and allows for saving of commands to a python script using %save
followed by the name of the script.
Hit control-d
to exit Ipython and return to the command prompt.
While using Ipython is better than just running commands on the terminal, it still is not very efficient in terms of saving work. Also, users will not be able to view plots on HPC systems such as Biowulf since these do not support inspection of graphical outputs.
Note
The pies_class_2025_ipython.py
script can be run from the command line. To run a Python script from command line, just do python
followed by name of the script. Python scripts can also be submitted as a job to the Biowulf batch system.
python pies_class_2025_ipython.py
hello
3.141592653589793
5.0
Using Python through IDE
Integrated Development Environments or IDEs are ideal for scripting in Python as well as other languages. See https://ritza.co/comparisons/pycharm-vs-spyder-vs-jupyter-vs-visual-studio-vs-anaconda-vs-intellij.html for a breakdown of of common ones such as Spyder, Pycharm, VS Code, R Studio, and Jupyter Lab. Essentially, IDE enable users to write scripts, access as well as view data, and view plots. Some IDEs enable users to generate analysis report that details steps of an analysis as well as the tool and the code use.
Accessing Python at NIH
- Biowulf (HPC OnDemand is recommended).
- Use Python locally on government furnished personal computer via NIH Anaconda Professional License. This will require users to install Anaconda to local computer. See BTEP's Topic Spotlight on NIH's Anaconda Professional license to learn more.
- NCI scientists also can use Python through Posit Workbench. Fill out the form at https://forms.office.com/pages/responsepage.aspx?id=eHW3FHOX1UKFByUcotwrBnYgWNrH6QdOsCsoiQ9eiaZUQ1ZZODJKT0FERUdHOVZYUkJaMzA2UDAxSi4u&route=shorturl to request access.
Signing onto Biowulf HPC OnDemand
- Open a web browser on local computer (Google Chrome is recommended) and go to https://hpcondemand.nih.gov/, which is the URL for Biowulf's HPC OnDemand.
- Once at HPC OnDemand, sign in with participant's NIH PIV card credentials.
- After signing in, users will see links to applications available through HPC OnDemand such as Jupyter. User's Biowulf file system can be accessed via OnDemand as well (just click Files in the menu bar).
Get the Example Data
Goto the course overview section in the class documents and scroll to the bottom and click on "Download data used in this course" to download some example data to local computer. Take note of where it downloads, but typically it should go into the user's local computer Downloads
folder. The data comes as zip file. While Macs will automatically unzip, Windows users may need to right click on the file to uncompress it. After the download and uncompressing is finished, participants will see a folder called pies_data
.
Next, click "Files" in the HPC OnDemand menu and choose the folder labeled /data/user
, where user is the participant's Biowulf user ID. The subsequent page will show the content (ie. files and folders) of the participant's Biowulf data
directory. Click on the "New Directory" tab to make a to make a folder to store the example data and Jupyter Notebook for this course series.
Name the class directory pies_data
. Click "Ok" when ready.
Start a Jupyter Lab Session
Navigate back to the HPC OnDemand website by clicking "HPC OnDemand" at the top left corner. Then click on the "Jupyter" tab to launch a Jupyter Lab session.
- The subsequent page allows users to specify compute resources. Leave these as is for this class. Also, be sure that the radial button for Jupyter Lab is selected.
- Make sure to specify for Jupyter to start in the
/data/$USER/pies_data
directory, where $USER is a variable that points to the participants Biowulf user ID.
Click on "Connect to Jupyter" when the Jupyter Lab session has been granted.
Users will see an interface that looks like the image shown below. The left hand panel is the file explorer. Users can navigate through files and folders that are available in the directory in which Jupyter Lab was started. The launcher panel contains quick links for initiating a Jupyter Notebook in the user's language of choice.
Note the file explorer is empty. The next step then will be to open the pies_data
folder on the participants local Download
directory and select all of the csv
files in the folder and drag and drop in to the Jupyter Lab file explorer or use the upload button.
Create a new Jupyter Notebook
Create a new Jupyter Notebook in Python 3.12 (click on the "python/3.12" tile). The new notebook has the name "Untitled.ipynb". Click on the disk icon in the notebook menu bar to rename it pies_class_2025.
Tip
For a detailed overview of Jupyter Lab, see BTEP's Documenting Analysis Steps using Jupyter Lab
Python Command Syntax
Arguments and options for Python commands are enclosed in parentheses. In general, the anatomy is command(argument, option)
.
For example, the command below is print
and it will display the argument, "Hello BTEP" as output.
print("Hello BTEP")
Hello BTEP
To get help for a Python command, use help
.
For instance:
help(print)
From the print
command's help information, line breaks can be added using \n
. Try the following to print three sentences, one in each line.
print("Python can make data analysis more efficient.\n"
"It helps with reusability and reproducibility.\n"
"There is strong community support.\n"
"External packages are available for data wrangling and visualization.")
Python can make data analysis more efficient.
It helps with reusability and reproducibility.
There is strong community support.
External packages are available for data wrangling and visualization.
What is different with
numpy.array
used in the earlier example to generate numeric arrays?
Answer
numpy
is a Python package that has many subcommands. To call a subcommand from a package, use the general syntax of package.subcommand
.
numpy
has a subcommanddivide
. How can that be called?
Answer
numpy.divide
What does the
divide
subcommand fromnumpy
do?
Answer
help(numpy.divide)
Installing external packages
Python external packages are found at the Python Package Index. To install a package from PyPi, just use pip install package_name
, where package_name can be any package of choice. For instance, to install scipy, do:
pip install scipy
Note
Package management for Python needs to be done at the terminal.
pip
is the package installer for Python. If pip
is not available with the user's Python installation, see https://pip.pypa.io/en/stable/installation/ to learn how to get it.
To uninstall a package, do pip uninstall package_name
.
To update a package, use pip install --upgrade package_name
.
pip freeze
will pull up a list of currently installed Packages installed via pip
. To find if a specific package is installed do pip freeze | grep package_name
.
Those who chose to use the package manager Anaconda can install via the command line using conda install package_name
. Again, package_name is the user's package of choice. Package managers offer the benefit of reducing issues that arise from versioning, dependency, and security when installing software. See https://docs.conda.io/projects/conda/en/stable/user-guide/tasks/manage-pkgs.html to learn more about installing, updating, and uninstalling packages using Conda. For working locally on government furnished personal computer, researchers are recommended to use the NIH Anaconda Professional License. Biowulf also has a guide on managing Anaconda environments on the cluster. See https://hpc.nih.gov/docs/diy_installation/conda.html.