Skip to content

Learning Objectives

  1. Learn about popular programming languagues in bioinformatics
  2. Compare advantages and disadvantages of Python and R
  3. Discuss what you will need to learn to use these languages
  4. Discuss learning resources

Choosing a programming language


What is a programming language?

A programming language is a formal language that specifies a set of instructions for a computer to perform specific tasks. It's used to write software programs and applications, and to control and manipulate computer systems.

Key features of programming languages include:
* Syntax (rules and structure used to write code)
* Data types (type of values that can be stored in a program)
* Variables (named memory locations that can store values)
* Operators (symbols used to perform operations on values)
* Control Structures (statements used to control the flow of a program)
* Libraries (collections of pre-written code used to perform common tasks and speed up development)
* Paradigms (programming styles / philosophies) --- GeeksforGeeks

Examples include C++, C#, Perl, Java, Ruby, Python, Julia, and R.
More on paradigms, here.

Why learn programming?

Do all molecular scientists need to learn a programming language?

  • Absolutely not.

BUT

  • We are in a big data era, and learning to code can be extremely beneficial, especially if you do not have access to bioinformatics analysts to analyze the data for you or expensive licensed software.

Which programming language should I learn?

  1. Bash

    • Most of bioinformatics can be done by understanding specific software applications and running those applications in a pipeline, usually using some form of bash scripting. Bash as a scripting language is fairly important for processing biological data, though arguably, not a formal programming language.
  2. Python or R

    • Depending on your goals, you may lean toward one programming language over another. For example:
      • Interested in statistics and data visualization? R may be for you.
      • Interested in software development and machine learning? A more general language like Python may be a better fit.

Check out this video!

Tip

Ultimately, what language you choose to learn will depend on what you actually want to do with your skills. If you want to become a bioinformatics analyst and are interested in developing scripts / programs for the greater community, you should probably learn bash, R, and python. If you are not developing pipelines or scripts for others to use, you can probably pick your poison. Though, you will likely still need to know some degree of all three.


What is R?

  • released in 1993
  • a computational language and environment for statitical computing and graphics.

    • complex statistical functions easily accessible
    • easy to get started, but more difficult to learn
  • Key features:

    • open-source
    • extensible (Packages on CRAN (> 19,000 packages), Github, Bioconductor)
    • wide community
    • Maintained by a network of collaborators - The R Core Team

Check out more on The R Project for Statistical Computing website.


What is Python?

  • developed as early as 1991
  • high-level, popular, general-purpose programming language that has a readable and easy to learn syntax
  • Key features:

    • easy to read
    • easy to learn
    • interpreted
    • multi-platform
    • wide community
    • open source libraries (> 300,000)
  • Two major versions (python2 and python3)

  • Not as easy to just start analyzing data

Check out more at https://www.python.org/.
Also, check out this primer for biologists.


Advantages of R and Python

R Programming

  • Data Visualization (Base R and ggplot2)

    • additional packages that enhance these, especially for -omics data
  • More packages for data science / bioinformatics

    • Bioconductor
  • Report generation

    • R Markdown
    • Quarto
  • more popular among scientists and academics (i.e., non-programmers)

Python

  • More consistent syntax (generally a right way to do something)
  • Large data manipulation (generally more efficient)

  • Report Generation

    • Jupyter Notebook
  • More popular among software developers and across multiple domains

Tip

There is no right answer to the question, "which programming language should I learn, R or Python?". They are both valuable programming languages with different strengths and weaknesses. Choosing one or the other will come down to several factors such as your analysis goals, the time you have to learn, and what those around you are using.

Check out this comparison from Toward Data Science, Python vs R: The Basics, author Sidney Kung:

Image from Toward Data Science, klzzwxh:0010, author Sidney Kung


What do you need to know to learn R or Python?


Installation

If you intend to use through Biowulf, no installation necessary.

R:

Python:


How do we execute our code?

With both R and Python, code is executed

  • interactively line by line from the command line
  • interactively in an IDE
  • as a script submitted from the command line or in an IDE

For python, to get started from the command line:

python
quit()

For R, to get started from the command line:

R
q()

What is an IDE?

An IDE is an integrated development environment. IDEs generally include features such as:

  • Console
  • File access
  • Environment / variable view
  • Data view
  • Plotting window
  • History
  • Autocomplete
  • Debugging
  • Markdown

IDEs make coding easier. They increase productivity and facilitate project management. Using an IDE will allow you to more effectively organize code and results as you tackle data analysis problems.


IDEs for R and Python

R

Python


Elements of programming with python or R

  • libraries
  • syntax
  • variables
  • functions
  • data types
  • loops and conditionals

Libraries

R Packages can be found at:

Python


Bioconductor

  • A repository for R packages related to biological data analysis, primarily bioinformatics and computational biology.
  • a great place to search for -omics packages and pipelines.
  • Released every 6 months and work with a specific version of R.

    • included packages are "mutually compatible, traceable, and guaranteed to function for the associated version of R"
    • Package types: Software, annotation, experimental data, workflows

  • Biopython
  • Bioconda

    • Conda, as a package management and environment management system was created for python but now can be used for any language.

R Syntax

  • more functional

    • built around functions (function_name())
  • Case sensitive

  • white space insensitive (rules for line continuation)
  • <- or = assignment operators
  • # used for comments
  • keywords or words with special meaning (?reserved)

    • for example, if, else, repeat, while, function, for, in, next, and break are used for control-flow statements and declaring user-defined functions.
  • statement grouping with {}

  • indexing starts with 1

Python Syntax

  • more object oriented (. is an operator and should not be used to name variables)
  • = assignment operator
  • 33 reserved words help("keywords")
  • lists use brackets [], dictionaries use {}
  • indentation is important (4 spaces) - defines blocks of code
  • indexing starts with 0

Compare the code

A syntax comparison from Dataquest: https://www.dataquest.io/blog/python-vs-r/.

Note

R code can be run using python with the rpy2 library. Python code can be executed through R using the reticulate package.


Variables

Essentially named storage that can be manipulated.

Rules for R variables:

  1. Avoid spaces or special characters EXCEPT '_' and '.'
  2. No numbers or underscores at the beginning of an object name.
  3. Avoid common names with special meanings (See ?Reserved) or assigned to existing functions (These will auto complete).
  4. Case sensitive

Rules for Python variables:

  1. Contains alpha-numeric characters and underscores
  2. Must start with a letter or the underscore character
  3. cannot start with a number
  4. case-sensitive

Functions

Used to perform specific tasks.

R:

product <- function(a,b){
  c<- a*b
  c
}
product(5,7)
[1] 35

Python:

def product(a,b):
  c = a*b
  return c

print(product(5,7))
35

Code example from https://www.r-bloggers.com/2017/05/r-vs-python-different-similarities-and-similar-differences/


Data Types

R:
Data types: integer, numeric, character, and logical
Data structures: vectors, lists, data frames, matrices.

x <- c(1,2,3)
typeof(x)
## [1] "double"
class(x)
## [1] "numeric"
is.vector(x)
## [1] TRUE

Python:
Data types: Integers, Floats, Long, Complex, Strings, booleans (TRUE, FALSE)
Data structures: arrays, tuples, lists, dictionaries

import numpy as np
x = [1,2,3] 
x = np.array(x)
print(type(x))
<class 'numpy.ndarray'>

Loops and conditionals

Loops - used to iterate over a sequence

R:

fruit <- c('apples','bananas','cantaloupe')

for(i in fruit) {
  print(i)
}
[1] "apples"
[1] "bananas"
[1] "cantaloupe"

Python:

fruit=['apples', 'bananas', 'cantaloupe'] #Loop for a list of fruits

for i in fruit:
    print(i)
apples
bananas
cantaloupe

Conditionals - code is executed based on conditions

R:

x<-3
y<-5

if(x<y){
  print(paste(x, 'is less than', y))
} else{
  print(paste(x, 'is not less than', y))
}
[1] "3 is less than 5"

Python:

x=3
y=5

if x<y:  
  print(x, 'is less than', y)
else:
  print(x, 'is not less than', y)
3 is less than 5

Resources to learn


BTEP and Others


Dataquest and Coursera

  • Dataquest - great for learning programming skills
  • Coursera - great for learning more specific skills

    Click here for license information.

Books and other resources:


Sources

  1. https://www.datacamp.com/blog/python-vs-r-for-data-science-whats-the-difference#gs.JrY_3bk
  2. https://shiring.github.io/r_vs_python/2017/01/22/R_vs_Py_post
  3. https://realpython.com/python-ides-code-editors-guide/
  4. https://medium.com/@hamza_33678/programming-for-bioinformatics-r-vs-python-52969a1f7a49#:~:text=While%20both%20R%20and%20Python,in%20keeping%20RAM%20consumption%20low.
  5. https://towardsdatascience.com/python-vs-r-the-basics-d754c45c1596
  6. https://www.dataquest.io/blog/python-vs-r/
  7. Learning Python for Data Science: What to Learn and Why, Cindy Sheffield, NIH Library