Skip to content

Introduction to Data Wrangling with R

Lesson Objectives

  • Introduce the course
  • Introduce concepts related to data wrangling and tidy data
  • Get familiar with R and RStudio Server on Biowulf

This lesson will introduce the philosophy of tidy data and key concepts and packages used for data wrangling with R.

Introduction to Data Wrangling

This course, designed for novices, will introduce the essential R packages and functions often used to explore, clean, transform, and summarize data. The content for this course is similar to past introductory R courses, but the pace of the course will be much slower to benefit novices.

Course Documentation: https://bioinformatics.ccr.cancer.gov/docs/r_for_novices/

::: {.notes} Course material will be updated prior to each lesson.

What is data wrangling? When I refer to data wrangling I am referring to the steps that need to be taken before downstream analyses such as modeling and visualization. These steps often include cleaning, transforming, and summarizing data. The two packages we will focus on the most for this purpose will be tidyr and dplyr; both are core tidyverse packages included in a tidyverse install. :::

Lessons

  1. June 17, 2025 - Introduction to Data Wrangling with R
  2. June 24, 2025 - Introducing Tidyr for Reshaping and Formatting Data
  3. July 1, 2025 - Subsetting Data with dplyr
  4. July 8, 2025 - Summarizing Data with dplyr
  5. July 15, 2025 - Joining and Transforming Data with dplyr

::: {.notes} There will be an optional help session following each lesson. :::

Prerequisites

This course is recommended for attendees familiar with the skills learned in Part 1: Getting Started with R.

Course materials

  • All material used for this course can be found in these pages.
  • For hands-on lessons, we will explore this material using R on Biowulf.
  • Requries a NIH HPC account.
  • Tidyverse packages come pre-installed.
  • Local R installations can be used, BUT
  • package installation issues will have to be handled outside of class time.
  • R / RStudio installation instructions can be found here.

::: {.notes} If you need help either with your R or RStudio installation or installing tidyverse packages, feel free to stay after class for help.
:::

Best Practices for Data Analysis

  1. Keep raw data separate from analyzed data.
  2. Keep spreadsheet data Tidy (or as tidy as possible).
  3. Trust but Verify. - R Basics (Intro to R and RStudio for Genomics)

::: {.notes} Before getting into tidy data and wrangling, I just want to review a few best practices for data analysis.

  1. you should always know whether you are handling raw data or manipulated data. if the data has been modified, you should know how it has been modified.
  2. We will talk about what is tidy data in a moment.
  3. Always examine the structure of your data.
    Top three troubleshoot tips: (RELEVANT TO NUMBER 3) google error message (check out warnings) check output to make sure it is what you think use a test data set :::

What is Tidy data?

Tidy data is an approach (or philosophy) to data organization and management. 

The 3 Rules of Tidy Data. Image from Lowndes and Horst 2020: Tidy Data for Efficiency, Reproducibility, and Collaboration

::: {.callout-note} Having tidy data is useful but not always necessary. Do not worry about strict adherence to the rules. Your data should be in whatever format that makes your life easier for analysis. :::

::: {.notes}

There are three rules to tidy data: (1) each variable forms its own column, (2) each observation forms a row, and (3) each value has its own cell. One advantage to following these rules is that the data structure remains consistent, making it easier to understand the tools that work well with the underlying structure, and there are a lot of tools in R built specifically to interact with tidy data. Equipped with the right tools will make data analysis more efficient. :::

Guidelines for Keeping Data Tidy

  • Be consistent.
  • Choose meaningful names for things; no spaces.
  • Write dates as YYYY-MM-DD.
  • No empty cells.
  • Put just one thing in a cell.
  • Don’t use font color or highlighting as data.
  • Save the data as plain text files. From https://jhudatascience.org/tidyversecourse/intro.html

What is the Tidyverse?

An opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures ---tidyverse.org

The core packages, or packages included in library(tidyverse) include

dplyr - functions for data manipulation.
ggplot2 - a system for data visualization.
forcats - functions for handling factors.
tibble - includes tibble constructor and helper functions.
readr - functions for data import.
stringr - functions for manipulating strings.
tidyr - includes functions for tidying data.
purrr - functions for replacing for loops.
lubridate - functions for working with dates and times.

::: {.notes} The tidyverse packages work exceptionally well with tidy data. They are also fairly user friendly and can be a lot easier to use for beginners than base R solutions. There tends to be a tidyverse solution to most data wrangling problems.

:::

Connect to HPC Open OnDemand

To connect to RStudio via NIH HPC Open OnDemand, follow the instructions outlined here.

NIH HPC OnDemand https://hpcondemand.nih.gov/

::: {.notes} Let's go ahead and get connected and reaquainted with RStudio.
:::

Install and load tidyverse packages

Packages from the Tidyverse are available from CRAN.

::: {.cell}

install.packages("tidyverse")
library(tidyverse)
:::

:::{.notes} CRAN stands for the Comprehensive R Archive Network. It is a global network of servers that store identical versions of R code, packages, documentation, etc (cran.r-project.org). CRAN is the primary repository for R packages. To install a CRAN package, we use install.packages("packageName"). Remember, we then have to load tidyverse with each new R session.
:::

Terms to Know

Function - code written to perform a specific task

  • Example: getwd()

String – a sequence of one or more characters

  • Enclosed by quotation marks

Data frame – object that stores tabular data; all variables are of the same length

Directory – location where files are stored

Working directory – your current directory

Package – the fundamental unit of shareable code, bundling together code, data, documentation, and tests. This is how we extend the use of R.

Library – a directory of installed packages

  • Example: .libPaths()

Directory Structures

  • A "path" is a string describing the location of a file / directory. These are nested structures (e.g., getwd()).

  • Absolute file path - the complete file path

  • Relative file path - a shortcut path from some other directory.

R Crash Course

Crash course: https://bioinformatics.ccr.cancer.gov/docs/data-wrangle-with-r/Lesson2/.

Getting Help

  • Web search (Google error messages)
  • Check out online tutorials
  • GitHub
  • Forums such as Stack Overflow, Bioconductor, etc.
  • Email us at ncibtep@nih.gov.

Learning Resources

::: {.notes} There are so many excellent resources available to learn R. You can check out our class materials on the BTEP website. You can search for specific resources via a web search or check out available git repositories on Github.

You may also be interested in Coursera licenses to access R programming classes, including many on the tidyverse, as well as a number of other courses related to bioinformatics. There are also so many tutorials available through a simple web search. :::