Lesson 4: Introduction to R Data Structures - Data Import

Learning Objectives

Learn about data structures including factors, lists, matrices, and data frames.
Learn how to import data in a tabular format (data frames)
Learn to write out (export) data from the R environment

To get started with this lesson, you will first need to connect to RStudio on Biowulf. To connect to NIH HPC Open OnDemand, you must be on the NIH network. Use the following website to connect: https://hpcondemand.nih.gov/. Then follow the instructions outlined here.

Installing and Loading Packages

In this lesson, we will learn how to import data with different file extensions, including Excel files. We will make use of Base R functions for data import as well as popular functions from readr and readxl.

So far we have only worked with objects that we created in RStudio. We have not installed or loaded any packages. R packages extend the use of R programming beyond base R.

Where do we get R packages?

As a reminder, R packages are loadable extensions that contain code, data, documentation, and tests in a standardized shareable format that can easily be installed by R users. The primary repository for R packages is the Comprehensive R Archive Network (CRAN). CRAN is a global network of servers that store identical versions of R code, packages, documentation, etc (cran.r-project.org). To install a CRAN package, use install.packages().

Github is another common source used to store R packages; though, these packages do not necessarily meet CRAN standards so approach with caution. To install a Github package, use library(devtools) followed by install_github(). devtools is a CRAN package. If you have not installed it, you may use install.packages("devtools") prior to the previous steps.

Many genomics and other packages useful to biologists / molecular biologists can be found on Bioconductor. To install a Bioconductor package, you will first need to install BiocManager, a CRAN package (install.packages("BiocManager")). You can then use BiocManager to install the Bioconductor core packages or any specific package (e.g., BiocManager::install("DESeq2")).

Packages are installed into your file system at a given location denoted by .libPaths(). This is your R library, a directory of installed R packages. To use one or more packages, you have to load it within your R session. This has to be done with each new R session.

Key functions:

install.packages() install packages from CRAN.
library() load packages in R session.

Load the libraries:

library(readxl)
library(readr)

Tip

It is good practice to load libraries needed for a script at the beginning of the script.

Data Structures

Data structures are objects that store data.

Previously, we learned that vectors are collections of values of the same type. A vector is also one of the most basic data structures.

Other common data structures in R include:

factors
lists
data frames
matrices

What are factors?

Factors are an important data structure in statistical computing. They are specialized vectors (ordered or unordered) for the storage of categorical data (data with fixed values). While they appear to be character vectors, data in factors are stored as integers. These integers are associated with pre-defined levels, which represent the different groups or categories in the vector.

Reference level

Generally for statistical models, the reference or control level is set to level 1. You can reorder the levels using factor() or forcats::relevel().

Important functions

factor() - to create a factor and reorder levels
as.factor() - to coerce to a factor
levels() - view and / or rename the levels of a factor
nlevels() - return the number of levels

For example:

sex <- factor(c("M","F","F","M","M","M"))
levels(sex)

[1] "F" "M"

Check out the package forcats for managing and reordering factors.

Note

R will organize factor levels alphabetically by default. This will be especially noticeable when plotting.

Warning

Pay attention when coercing from a factor to a numeric. To do this, you should first convert to a character vector. Otherwise, the numbers that you want to be numeric (the factor level names) will be returned as integers.

See more about working with factors here.

Lists

Unlike an atomic vector, a list can contain multiple elements of different types, (e.g., character vector, numeric vector, list, data frame, matrix). Lists are not the focus of this lesson, but you should be aware of them, as you will likely come across them at some point, as many functions, including those specific to bioinformatics, may output data in the form of a list.

Important functions

list() - create a list
names() - create named elements (Also useful for vectors)
lapply(), sapply() - for looping over elements of the list

Example

#Create a list
My_exp <- list(c("N052611", "N061011", "N080611", "N61311" ), 
               c("SRR1039508", "SRR1039509", "SRR1039512",
                 "SRR1039513", "SRR1039516", "SRR1039517",
                 "SRR1039520", "SRR1039521"),c(100,200,300,400))

#Look at the structure
str(My_exp)

List of 3
 $ : chr [1:4] "N052611" "N061011" "N080611" "N61311"
 $ : chr [1:8] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" ...
 $ : num [1:4] 100 200 300 400

#Name the elements of the list 
names(My_exp)<-c("cell_lines","sample_id","counts")
#See how the structure changes
str(My_exp)

List of 3
 $ cell_lines: chr [1:4] "N052611" "N061011" "N080611" "N61311"
 $ sample_id : chr [1:8] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" ...
 $ counts    : num [1:4] 100 200 300 400

#Subset the list
My_exp[[1]][2]

[1] "N061011"

My_exp$cell_lines[2]

[1] "N061011"

#Apply a function (remove the first index from each vector)
lapply(My_exp,function(x){x[-1]})

$cell_lines
[1] "N061011" "N080611" "N61311" 

$sample_id
[1] "SRR1039509" "SRR1039512" "SRR1039513" "SRR1039516" "SRR1039517"
[6] "SRR1039520" "SRR1039521"

$counts
[1] 200 300 400

We are not going to spend a lot of time on lists, but you should consider learning more about them in the future, as you may receive output at some point in the form of a list. For a brief introduction to lists, see the following resources:

Data Matrices

Another important data structure in R is the data matrix. Data frames and data matrices are similar in that both are tabular in nature and are defined by dimensions (i.e., rows (m) and columns (n), commonly denoted m x n). However, a matrix contains only values of a single type (i.e., numeric, character, logical, etc.).

Note

A vector can be viewed as a 1 dimensional matrix.

Elements in a matrix and a data frame can be referenced by using their row and column indices (for example, a[1,1] references the element in row 1 and column 1).

Below, we create the object a1, a 3-row by 4-column matrix.

a1 <- matrix(c(3,4,2,4,6,3,8,1,7,5,3,2), ncol=4)
a1

     [,1] [,2] [,3] [,4]
[1,]    3    4    8    5
[2,]    4    6    1    3
[3,]    2    3    7    2

Using the typeof() and class() command, we see that the elements in a1 are double and a1 a matrix, respectively.

typeof(a1)

[1] "double"

class(a1)

[1] "matrix" "array"

Similar to lists, we aren't going to focus much on matrices.

Data Frames: Working with Tabular Data

In genomics, we work with a lot of tabular data - data organized in rows and columns. The data structure that stores this type of data is a data frame. Data frames are collections of vectors of the same length but can be of different types. Because we often have data of multiple types, it is natural to examine that data in a data frame.

You may be tempted to open and manually work with these data in excel. However, there are a number of reasons why this can be to your detriment. First, it is very easy to make mistakes when working with large amounts of tabular data in excel. Have you ever mistakenly left out a column or row while sorting data? Second, many of the files that we work with are so large (big data) that excel and your local machine do not have the bandwidth to handle them. Third, you will likely need to apply analyses that are unavailable in excel. Lastly, it is difficult to keep track of any data manipulation steps or analyses in a point and click environment like excel.

R, on the other hand, can make analyzing tabular data more efficient and reproducible. But before getting into working with this data in R, let's review some best practices for data management.

Best Practices for organizing genomic data

"Keep raw data separate from analyzed data" -- datacarpentry.org

For large genomic data sets, you may want to include a project folder with two main subdirectories (i.e., raw_data and data_analysis). You may even consider changing the permissions (check out the unix command chmod) in your raw directory to make those files read only. Keeping raw data separate is not a problem in R, as one must explicitly import and export data.

"Keep spreadsheet data Tidy" -- datacarpentry.org

Data organization can be frustrating, and many scientists devote a great deal of time and energy toward this task. Keeping data tidy, can make data science more efficient, effective, and reproducible. There is a collection of packages in R that embrace the philosophy of tidy data and facilitate working with data frames. That collection is known as the tidyverse.

"Trust but verify" -- datacarpentry.org

R makes data analysis more reproducible and can eliminate some mistakes from human error. However, you should approach data analysis with a plan, and make sure you understand what a function is doing before applying it to your data. Often using small subsets of data can be used as a form of data debugging to make sure the expected result materialized.

Some functions for creating practice data include: data.frame(), rep(), seq(), rnorm(), sample() and others. See some examples here.

Let's use some of these to create a data frame.

df<-data.frame(Samples=c(1:10),Counts=sample(1:5000, size=10, replace = TRUE),Treatment=rep(c("control", "treated"), each=5))
df

   Samples Counts Treatment
1        1   4939   control
2        2    191   control
3        3   3697   control
4        4   4933   control
5        5   2938   control
6        6   1721   treated
7        7    214   treated
8        8   2999   treated
9        9   2084   treated
10      10   2196   treated

Example Data

There are data sets available in R to practice with or showcase different packages; for example, library(help = "datasets"). For the next two lessons, we will use data derived from the Bioconductor package airway as well as data internal to or derived from Base R and packages within the tidyverse. Check out the Acknowledgements section for additional data sources.

Obtaining the data

To download the data used in this lesson to your local computer, click here. You can then move the downloaded directory to your working directory in R.
To use the data on Biowulf, open your Terminal in R and follow these steps:

cd /data/$USER/Getting_Started_with_R
wget https://bioinformatics.ccr.cancer.gov/docs/r_for_novices/Getting_Started_with_R/data.zip
unzip data.zip

Note

"Getting_Started_with_R" is the name of the project directory I created in Lesson 1. If you do not have this directory, make sure you change directories to your working directory in R.

Importing Data

Before we can do anything with our data, we need to first import it into R. There are several ways to do this.

First, the RStudio IDE has a drop down menu for data import. Simply go to File > Import Dataset and select one of the options and follow the prompts.

Pay close attention to the import functions and their arguments. Using the import arguments correctly can save you from a headache later down the road. You will notice two types of import functions under Import Dataset "from text": base R import functions and readr import functions. We will use both in this course.

Row names

Tidyverse packages are generally against assigning rownames and instead prefer that all column data are treated the same, but there are times when this is beneficial and will be required for genomics data (e.g., See SummarizedExperiment from Bioconductor).

What is a tibble?

When loading tabular data with readr, the default object created will be a tibble. Tibbles are like data frames with some small but apparent modifications. For example, they can have numbers for column names, and the column types are immediately apparent when viewing. Additionally, when you call a tibble by running the object name, the entire data frame does not print to the screen, rather the first ten rows along with the columns that fit the screen are shown.

Reasons to use `readr` functions

Compared to the corresponding base functions, readr functions:

Use a consistent naming scheme for the parameters (e.g. col_names and col_types not header and colClasses).

Are generally much faster (up to 10x-100x) depending on the dataset.

Leave strings as is by default, and automatically parse common date/time formats.

Have a helpful progress bar if loading is going to take a while.

All functions work exactly the same way regardless of the current locale. To override the US-centric defaults, use locale(). - readr.tidyverse.org.

Excel files (.xls, .xlsx)

Excel files are the primary means by which many people save spreadsheet data. .xls or .xlsx files store workbooks composed of one or more spreadsheets.

Importing excel files requires the R package readxl. While this is a tidyverse package, it is not core and must be loaded separately. We loaded this above.

The functions to import excel files are read_excel(), read_xls(), and read_xlsx(). The latter two are more specific based on file format, whereas the first will guess which format (.xls or .xlsx) we are working with.

Let's look at its basic usage using an example data set from the readxl package. To access the example data we use readxl_example().

#makes example data accessible by storing the path 
ex_xl<-readxl_example("datasets.xlsx")  
ex_xl

[1] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/readxl/extdata/datasets.xlsx"

Now, let's read in the data. The only required argument is a path to the file to be imported.

irisdata<-read_excel(ex_xl)
irisdata

# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

Notice that the resulting imported data is a tibble. This is a feature specific to tidyverse. Now, let's check out some of the additional arguments. We can view the help information using ?read_excel().

The arguments likely to be most pertinent to you are:

sheet - the name or numeric position of the excel sheet to read.
col_names - default TRUE uses the first read in row for the column names. You can also provide a vector of names to name the columns.
skip - will allow us to skip rows that we do not wish to read in.
.name_repair - automatically set to "unique", which makes sure that the column names are not empty and are all unique. read_excel() and readr functions will not correct column names to make them syntactic. If you want corrected names, use .name_repair = "universal".

Let's check out another example:

sum_air<-read_excel("./data/RNASeq_totalcounts_vs_totaltrans.xlsx")

New names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`

sum_air

# A tibble: 11 × 4
   `Uses Airway Data`               ...2          ...3                  ...4    
   <chr>                            <chr>         <chr>                 <chr>   
 1 Some RNA-Seq summary information <NA>          <NA>                  <NA>    
 2 <NA>                             <NA>          <NA>                  <NA>    
 3 Sample Name                      Treatment     Number of Transcripts Total C…
 4 GSM1275863                       Dexamethasone 10768                 18783120
 5 GSM1275867                       Dexamethasone 10051                 15144524
 6 GSM1275871                       Dexamethasone 11658                 30776089
 7 GSM1275875                       Dexamethasone 10900                 21135511
 8 GSM1275862                       None          11177                 20608402
 9 GSM1275866                       None          11526                 25311320
10 GSM1275870                       None          11425                 24411867
11 GSM1275874                       None          11000                 19094104

Upon importing these data, we can immediately see that something is wrong with the column names.

colnames(sum_air)

[1] "Uses Airway Data" "...2"             "...3"             "...4"

There are some extra rows of information at the beginning of the data frame that should be excluded. We can take advantage of additional arguments to load only the data we are interested in. We are also going to tell read_excel() that we want the names repaired to eliminate spaces.

sum_air<-read_excel("./data/RNASeq_totalcounts_vs_totaltrans.xlsx",
                    skip=3,.name_repair = "universal")

New names:
• `Sample Name` -> `Sample.Name`
• `Number of Transcripts` -> `Number.of.Transcripts`
• `Total Counts` -> `Total.Counts`

sum_air

# A tibble: 8 × 4
  Sample.Name Treatment     Number.of.Transcripts Total.Counts
  <chr>       <chr>                         <dbl>        <dbl>
1 GSM1275863  Dexamethasone                 10768     18783120
2 GSM1275867  Dexamethasone                 10051     15144524
3 GSM1275871  Dexamethasone                 11658     30776089
4 GSM1275875  Dexamethasone                 10900     21135511
5 GSM1275862  None                          11177     20608402
6 GSM1275866  None                          11526     25311320
7 GSM1275870  None                          11425     24411867
8 GSM1275874  None                          11000     19094104

Tab-delimited files (.tsv, .txt)

In tab delimited files, data columns are separated by tabs.

To import tab-delimited files there are several options. There are base R functions such as read.delim() and read.table() as well as the readr functions read_delim(), read_tsv(), and read_table().

Let's take a look at ?read.delim() and ?read_delim(), which are most appropriate if you are working with tab delimited data stored in a .txt file.

For read.delim(), you will notice that the default separator (sep) is white space, which can be one or more spaces, tabs, newlines. However, you could use this function to load a comma separated file as well; you simply need to use sep = ",". The same is true of read_delim(), except the argument is delim rather than sep.

Let's load sample information from the RNA-Seq project airway. We will refer back to some of these data frequently throughout our lessons. The airway data is from Himes et al. (2014). These data, which are available in R as a RangedSummarizedExperiment object, are from a bulk RNA-Seq experiment. In the experiment, the authors "characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone," a common therapy for asthma. The airway package includes RNAseq count data from 8 airway smooth muscle cell samples. Each cell line includes a treated and untreated negative control.

Using read.delim():

smeta<-read.delim("./data/airway_sampleinfo.txt")
head(smeta)

  SampleName    cell   dex albut        Run avgLength Experiment    Sample
1 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345 SRS508568
2 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346 SRS508567
3 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349 SRS508571
4 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350 SRS508572
5 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353 SRS508575
6 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354 SRS508576
     BioSample
1 SAMN02422669
2 SAMN02422675
3 SAMN02422678
4 SAMN02422670
5 SAMN02422682
6 SAMN02422673

Some other arguments of interest for read.delim():
row.names - used to specify row names.
col.names - use to specify column names if header = FALSE.
skip - Similar to read_excel(), used to skip a number of lines preceding the data we are interested in importing.
check.names - makes names syntactically valid and unique.

Using read_delim():

smeta2<-read_delim("./data/airway_sampleinfo.txt")

Rows: 8 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
dbl (1): avgLength

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

smeta2

# A tibble: 8 × 9
  SampleName cell    dex   albut Run       avgLength Experiment Sample BioSample
  <chr>      <chr>   <chr> <chr> <chr>         <dbl> <chr>      <chr>  <chr>    
1 GSM1275862 N61311  untrt untrt SRR10395…       126 SRX384345  SRS50… SAMN0242…
2 GSM1275863 N61311  trt   untrt SRR10395…       126 SRX384346  SRS50… SAMN0242…
3 GSM1275866 N052611 untrt untrt SRR10395…       126 SRX384349  SRS50… SAMN0242…
4 GSM1275867 N052611 trt   untrt SRR10395…        87 SRX384350  SRS50… SAMN0242…
5 GSM1275870 N080611 untrt untrt SRR10395…       120 SRX384353  SRS50… SAMN0242…
6 GSM1275871 N080611 trt   untrt SRR10395…       126 SRX384354  SRS50… SAMN0242…
7 GSM1275874 N061011 untrt untrt SRR10395…       101 SRX384357  SRS50… SAMN0242…
8 GSM1275875 N061011 trt   untrt SRR10395…        98 SRX384358  SRS50… SAMN0242…

What if we want to retain row names?

Let's load in a count matrix from airway.

aircount<-read.delim("./data/head50_airway_nonnorm_count.txt")  
head(aircount)

                         X Accession.SRR1039508 Accession.SRR1039509
1   ENSG00000000003.TSPAN6                  679                  448
2     ENSG00000000005.TNMD                    0                    0
3     ENSG00000000419.DPM1                  467                  515
4    ENSG00000000457.SCYL3                  260                  211
5 ENSG00000000460.C1orf112                   60                   55
6      ENSG00000000938.FGR                    0                    0
  Accession.SRR1039512 Accession.SRR1039513 Accession.SRR1039516
1                  873                  408                 1138
2                    0                    0                    0
3                  621                  365                  587
4                  263                  164                  245
5                   40                   35                   78
6                    2                    0                    1
  Accession.SRR1039517 Accession.SRR1039520 Accession.SRR1039521
1                 1047                  770                  572
2                    0                    0                    0
3                  799                  417                  508
4                  331                  233                  229
5                   63                   76                   60
6                    0                    0                    0

Because this is a count matrix, we want to save column 'X', which was automatically named, as row names rather than a column. Remember, readr is a part of the tidyverse and does not play well with row names. Therefore, we will use read.delim() withe the argument row.names.

Let's reload and overwrite the previous object:

aircount<-read.delim("./data/head50_airway_nonnorm_count.txt",
                     row.names = 1)  
head(aircount)

                         Accession.SRR1039508 Accession.SRR1039509
ENSG00000000003.TSPAN6                    679                  448
ENSG00000000005.TNMD                        0                    0
ENSG00000000419.DPM1                      467                  515
ENSG00000000457.SCYL3                     260                  211
ENSG00000000460.C1orf112                   60                   55
ENSG00000000938.FGR                         0                    0
                         Accession.SRR1039512 Accession.SRR1039513
ENSG00000000003.TSPAN6                    873                  408
ENSG00000000005.TNMD                        0                    0
ENSG00000000419.DPM1                      621                  365
ENSG00000000457.SCYL3                     263                  164
ENSG00000000460.C1orf112                   40                   35
ENSG00000000938.FGR                         2                    0
                         Accession.SRR1039516 Accession.SRR1039517
ENSG00000000003.TSPAN6                   1138                 1047
ENSG00000000005.TNMD                        0                    0
ENSG00000000419.DPM1                      587                  799
ENSG00000000457.SCYL3                     245                  331
ENSG00000000460.C1orf112                   78                   63
ENSG00000000938.FGR                         1                    0
                         Accession.SRR1039520 Accession.SRR1039521
ENSG00000000003.TSPAN6                    770                  572
ENSG00000000005.TNMD                        0                    0
ENSG00000000419.DPM1                      417                  508
ENSG00000000457.SCYL3                     233                  229
ENSG00000000460.C1orf112                   76                   60
ENSG00000000938.FGR                         0                    0

Comma separated files (.csv)

In comma separated files the columns are separated by commas and the rows are separated by new lines.

To read comma separated files, we can use the specific functions ?read.csv() and ?read_csv().

Let's see this in action:

cexamp<-read.csv("./data/surveys_datacarpentry.csv")
head(cexamp)

  record_id month day year plot_id species_id sex hindfoot_length weight
1         1     7  16 1977       2         NL   M              32     NA
2         2     7  16 1977       3         NL   M              33     NA
3         3     7  16 1977       2         DM   F              37     NA
4         4     7  16 1977       7         DM   M              36     NA
5         5     7  16 1977       3         DM   M              35     NA
6         6     7  16 1977       1         PF   M              14     NA

The arguments are the same as read.delim().

Let's check out read_csv():

cexamp2<-read_csv("./data/surveys_datacarpentry.csv")

Rows: 35549 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): species_id, sex
dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

cexamp2

# A tibble: 35,549 × 9
   record_id month   day  year plot_id species_id sex   hindfoot_length weight
       <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
 1         1     7    16  1977       2 NL         M                  32     NA
 2         2     7    16  1977       3 NL         M                  33     NA
 3         3     7    16  1977       2 DM         F                  37     NA
 4         4     7    16  1977       7 DM         M                  36     NA
 5         5     7    16  1977       3 DM         M                  35     NA
 6         6     7    16  1977       1 PF         M                  14     NA
 7         7     7    16  1977       2 PE         F                  NA     NA
 8         8     7    16  1977       1 DM         M                  37     NA
 9         9     7    16  1977       1 DM         F                  34     NA
10        10     7    16  1977       6 PF         F                  20     NA
# ℹ 35,539 more rows

Other file types

There are a number of other file types you may be interested in. For genomic specific formats, you will likely need to install specific packages; check out Bioconductor for packages relevant to bioinformatics.

For information on importing other files types (e.g., json, xml, google sheets), check out this chapter from Tidyverse Skills for Data Science by Carrie Wright, Shannon E. Ellis, Stephanie C. Hicks and Roger D. Peng.

Data Export.

To export data to file, you will use similar functions (write.table(),write.csv(),saveRDS(), etc.).

For example, let's save df to a csv file.

write_csv(df,"./data/small_df_example.csv")

Acknowledgements

Some material from this lesson was either taken directly or adapted from Intro to R and RStudio for Genomics provided by datacarpentry.org. Other material from this lesson was inspired by R4DS and Tidyverse Skills for Data Science. The survey data loaded in this lesson was taken from datacarpentry.org.

Lesson 4: Introduction to R Data Structures - Data Import

Learning Objectives

Installing and Loading Packages

Where do we get R packages?

Data Structures

What are factors?

Important functions

Lists

Important functions

Example

Data Matrices

Data Frames: Working with Tabular Data

Best Practices for organizing genomic data

Example Data

Obtaining the data

Importing Data

What is a tibble?

Reasons to use readr functions

Excel files (.xls, .xlsx)

Tab-delimited files (.tsv, .txt)

Comma separated files (.csv)

Other file types

Data Export.

Acknowledgements

Reasons to use `readr` functions