Skip to content

Python

Description

Python is a programming language used in many different applications including data science. It is a high-level computer language, as the syntax is easily read and understood. Python is considered a beginner-friendly language. Python also includes packages for machine learning. See Datacamp for more information about Python.

Listing of Analysis Functions

There is extensive community support for Python because it is open source and there are many external packages that add to Python functionality.

Data wrangling

  • Built-in functions for importing and working with tabular data with file extensions CSV or TXT.
  • Pandas is an external package that allows users to import and work with tabular data. Among the file extensions supported by this package are comma separated (CSV), TXT, and XLS/XLSX. Pandas makes it easier to work with tabular data as compared to the built-in Python functions.

Computing

  • NumPy is a Python package for scientific computing. NumPy allows users to perform tasks such as basic arithmetic operations, array and matrix operations, and linear algebra.
  • Math is capable of basic math operations including those involving complex numbers. Math also contains several relevant mathematical constants such as pi.
  • SciPy is another package for computing in Python. Its functions include differentiation, integration, interploation, optimization, and image processing. Importantly, Scipy contains an extensive list of scientific constants.

Data visualization

  • Matplotlib is a capable and popular data visualization tool for Python.
  • Seaborn is an extension of Matplotlib with supposedly simpler syntax. See here for some differences between Seaborn and Matlab.
  • Plotly makes interactive plots.

Machine learning

Molecular biosciences

  • ACTINN can be used for automated identification of cell types in single cell RNA sequencing studies. This packages utilizes PyTorch.
  • scDeepCluster is a tool for single cell clustering that utilizes deep learning approaches using TensorFlow and Keras.
  • Scanpy is a package for single cell RNA sequencing analysis.
  • scvelo can be used for single cell velocity analysis.
  • Biopython is a package that contains functionalities for molecular biology analysis. It contains modules for sequence alignment, exploring protein 3D structure, population genetics, interfacing with databases housed at NCBI and many more.
  • PyPop is a package for population genetics.
  • simuPOP is used for forward-time population genetics analysis.

Recommendations

Things to know

Python can be accessed via either the command line or an Integrated Development Environments (IDE) that provides a graphical user interface. Available IDEs for Python include Spyder, PyCharm, R Studio, and Microsoft's Visual Studio Code (which is also available on Biowulf).

Using a Jupyter Notebook is another way to interface with Python. Jupyter Notebook can be viewed as a lab notebook for data analysis, and can include text based descriptions of analyses procedures along with code. Using a Jupyter Notebook allows us to see outputs and visualizations similar to IDEs and is easily accessible via a web browser.

Input Data Types

There are many data types that can be used as input for Python programs, including CSV, TXT and XLS/XLSX.

Output Data Types

Python can produce tabular data and data visualizations. Tabular data can be exported into various formats such as CSV, TXT, and XLSX, and visualizations can be exported as PNG, JPG, or TIF.

Access Information

Python 2 is pre-installed on MacOS computers. This will need to be updated to the current version Python 3.

Python is also accessible on the NIH high performance Unix cluster Biowulf.

For Python installations on NIH laptops, please submit a ticket to service.cancer.gov.

Getting Help

Online learning platforms Coursera and Dataquest both have Python classes. To request a license see https://bioinformatics.ccr.cancer.gov/btep/self-learning/. Below are some recommended courses from Coursera and Dataquest for those who wish to begin learning Python.

Coursera suggestions:

Dataquest suggestions: