Learning Objectives
- Learn about popular programming languagues in bioinformatics
- Compare advantages and disadvantages of Python and R
- Discuss what you will need to learn to use these languages
- Discuss learning resources
Choosing a programming language
What is a programming language?
A programming language is a formal language that specifies a set of instructions for a computer to perform specific tasks. It's used to write software programs and applications, and to control and manipulate computer systems.
Key features of programming languages include:
* Syntax (rules and structure used to write code)
* Data types (type of values that can be stored in a program)
* Variables (named memory locations that can store values)
* Operators (symbols used to perform operations on values)
* Control Structures (statements used to control the flow of a program)
* Libraries (collections of pre-written code used to perform common tasks and speed up development)
* Paradigms (programming styles / philosophies) --- GeeksforGeeks
Examples include C++, C#, Perl, Java, Ruby, Python, Julia, and R.
More on paradigms, here.
Why learn programming?
Do all molecular scientists need to learn a programming language?
- Absolutely not.
BUT
- We are in a big data era, and learning to code can be extremely beneficial, especially if you do not have access to bioinformatics analysts to analyze the data for you or expensive licensed software.
Which programming language should I learn?
-
Bash
- Most of bioinformatics can be done by understanding specific software applications and running those applications in a pipeline, usually using some form of bash scripting. Bash as a scripting language is fairly important for processing biological data, though arguably, not a formal programming language.
-
Python or R
- Depending on your goals, you may lean toward one programming language over another. For example:
- Interested in statistics and data visualization? R may be for you.
- Interested in software development and machine learning? A more general language like Python may be a better fit.
- Depending on your goals, you may lean toward one programming language over another. For example:
Check out this video!
Tip
Ultimately, what language you choose to learn will depend on what you actually want to do with your skills. If you want to become a bioinformatics analyst and are interested in developing scripts / programs for the greater community, you should probably learn bash, R, and python. If you are not developing pipelines or scripts for others to use, you can probably pick your poison. Though, you will likely still need to know some degree of all three.
What is R?
- released in 1993
-
a computational language and environment for statitical computing and graphics.
- complex statistical functions easily accessible
- easy to get started, but more difficult to learn
-
Key features:
- open-source
- extensible (Packages on CRAN (> 19,000 packages), Github, Bioconductor)
- wide community
- Maintained by a network of collaborators - The R Core Team
Check out more on The R Project for Statistical Computing website.
What is Python?
- developed as early as 1991
- high-level, popular, general-purpose programming language that has a readable and easy to learn syntax
-
Key features:
- easy to read
- easy to learn
- interpreted
- multi-platform
- wide community
- open source libraries (> 300,000)
-
Two major versions (
python2
andpython3
) - Not as easy to just start analyzing data
Check out more at https://www.python.org/.
Also, check out this primer for biologists.
Advantages of R and Python
R Programming
-
Data Visualization (Base R and ggplot2)
- additional packages that enhance these, especially for -omics data
-
More packages for data science / bioinformatics
- Bioconductor
-
Report generation
- R Markdown
- Quarto
-
more popular among scientists and academics (i.e., non-programmers)
Python
- More consistent syntax (generally a right way to do something)
-
Large data manipulation (generally more efficient)
- shines in machine learning (scikit-learn)
-
Report Generation
- Jupyter Notebook
-
More popular among software developers and across multiple domains
Tip
There is no right answer to the question, "which programming language should I learn, R or Python?". They are both valuable programming languages with different strengths and weaknesses. Choosing one or the other will come down to several factors such as your analysis goals, the time you have to learn, and what those around you are using.
Check out this comparison from Toward Data Science, Python vs R: The Basics, author Sidney Kung:
What do you need to know to learn R or Python?
Installation
If you intend to use through Biowulf, no installation necessary.
R:
- Use this guide.
Python:
- You can download directly from https://www.python.org/downloads/.
How do we execute our code?
With both R and Python, code is executed
- interactively line by line from the command line
- interactively in an IDE
- as a script submitted from the command line or in an IDE
For python, to get started from the command line:
python
quit()
For R, to get started from the command line:
R
q()
What is an IDE?
An IDE is an integrated development environment. IDEs generally include features such as:
- Console
- File access
- Environment / variable view
- Data view
- Plotting window
- History
- Autocomplete
- Debugging
- Markdown
IDEs make coding easier. They increase productivity and facilitate project management. Using an IDE will allow you to more effectively organize code and results as you tackle data analysis problems.
IDEs for R and Python
R
Python
-
JupyterLab / Jupyter Notebook*
- Can be used with C++, Julia, GNU octave, R, Ruby, and Scheme
- iPython
- Google colab
Elements of programming with python or R
- libraries
- syntax
- variables
- functions
- data types
- loops and conditionals
Libraries
R Packages can be found at:
-
- METACRAN- to search for packages
- Github
Python
Bioconductor
- A repository for R packages related to biological data analysis, primarily bioinformatics and computational biology.
- a great place to search for -omics packages and pipelines.
-
Released every 6 months and work with a specific version of R.
- included packages are "mutually compatible, traceable, and guaranteed to function for the associated version of R"
- Package types: Software, annotation, experimental data, workflows
Bioinformatics related python packages
- Biopython
-
- Conda, as a package management and environment management system was created for python but now can be used for any language.
R Syntax
-
more functional
- built around functions (
function_name()
)
- built around functions (
-
Case sensitive
- white space insensitive (rules for line continuation)
<-
or=
assignment operators#
used for comments-
keywords or words with special meaning (
?reserved
)- for example,
if
,else
,repeat
,while
,function
,for
,in
,next
, andbreak
are used for control-flow statements and declaring user-defined functions.
- for example,
-
statement grouping with
{}
- indexing starts with 1
Python Syntax
- more object oriented (
.
is an operator and should not be used to name variables) =
assignment operator- 33 reserved words
help("keywords")
- lists use brackets
[]
, dictionaries use{}
- indentation is important (4 spaces) - defines blocks of code
- indexing starts with 0
Compare the code
A syntax comparison from Dataquest: https://www.dataquest.io/blog/python-vs-r/.
Note
R code can be run using python with the rpy2
library. Python code can be executed through R using the reticulate
package.
Variables
Essentially named storage that can be manipulated.
Rules for R variables:
- Avoid spaces or special characters EXCEPT '_' and '.'
- No numbers or underscores at the beginning of an object name.
- Avoid common names with special meanings (See ?Reserved) or assigned to existing functions (These will auto complete).
- Case sensitive
Rules for Python variables:
- Contains alpha-numeric characters and underscores
- Must start with a letter or the underscore character
- cannot start with a number
- case-sensitive
Functions
Used to perform specific tasks.
R:
product <- function(a,b){
c<- a*b
c
}
product(5,7)
[1] 35
Python:
def product(a,b):
c = a*b
return c
print(product(5,7))
35
Code example from https://www.r-bloggers.com/2017/05/r-vs-python-different-similarities-and-similar-differences/
Data Types
R:
Data types: integer, numeric, character, and logical
Data structures: vectors, lists, data frames, matrices.
x <- c(1,2,3)
typeof(x)
## [1] "double"
class(x)
## [1] "numeric"
is.vector(x)
## [1] TRUE
Python:
Data types: Integers, Floats, Long, Complex, Strings, booleans (TRUE, FALSE)
Data structures: arrays, tuples, lists, dictionaries
import numpy as np
x = [1,2,3]
x = np.array(x)
print(type(x))
<class 'numpy.ndarray'>
Loops and conditionals
Loops - used to iterate over a sequence
R:
fruit <- c('apples','bananas','cantaloupe')
for(i in fruit) {
print(i)
}
[1] "apples"
[1] "bananas"
[1] "cantaloupe"
Python:
fruit=['apples', 'bananas', 'cantaloupe'] #Loop for a list of fruits
for i in fruit:
print(i)
apples
bananas
cantaloupe
Conditionals - code is executed based on conditions
R:
x<-3
y<-5
if(x<y){
print(paste(x, 'is less than', y))
} else{
print(paste(x, 'is not less than', y))
}
[1] "3 is less than 5"
Python:
x=3
y=5
if x<y:
print(x, 'is less than', y)
else:
print(x, 'is not less than', y)
3 is less than 5
Resources to learn
BTEP and Others
- Check the NIH Bioinformatics Calendar for upcoming events including courses or lessons on python and R.
-
Past BTEP courses
- NIAID Bioinformatics Resources
Dataquest and Coursera
- Dataquest - great for learning programming skills
-
Coursera - great for learning more specific skills
Click here for license information.
Books and other resources:
- See this list for introductory R material.
- A Primer for Computational Biology, Shawn T. O'Neil
Sources
- https://www.datacamp.com/blog/python-vs-r-for-data-science-whats-the-difference#gs.JrY_3bk
- https://shiring.github.io/r_vs_python/2017/01/22/R_vs_Py_post
- https://realpython.com/python-ides-code-editors-guide/
- https://medium.com/@hamza_33678/programming-for-bioinformatics-r-vs-python-52969a1f7a49#:~:text=While%20both%20R%20and%20Python,in%20keeping%20RAM%20consumption%20low.
- https://towardsdatascience.com/python-vs-r-the-basics-d754c45c1596
- https://www.dataquest.io/blog/python-vs-r/
- Learning Python for Data Science: What to Learn and Why, Cindy Sheffield, NIH Library