`dplyr`: joining, tranforming, and summarizing data frames

Objectives
Today we will continue to wrangle data using the tidyverse package, dplyr. We will learn:

how to join data frames using dplyr
how to transform and create new variables using mutate()
how to summarize variables using group_by() and summarize()

Loading `dplyr`

In this lesson, we are continuing with the package dplyr. We do not need to load the dplyr package separately, as it is a core tidyverse package. Again, if you need to install and load only dplyr, use install.packages("dplyr") and library(dplyr).

Load the package:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Data

Let's load in some data to work with. In this lesson, we will continue to use sample metadata, raw count data, and differential expression results from the airway RNA-Seq project.

Load the data:

#sample information  
smeta<-read_delim("./data/airway_sampleinfo.txt")

## Rows: 8 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
## dbl (1): avgLength
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

smeta

## # A tibble: 8 × 9
##   SampleName cell    dex   albut Run       avgLength Experiment Sample BioSample
##   <chr>      <chr>   <chr> <chr> <chr>         <dbl> <chr>      <chr>  <chr>    
## 1 GSM1275862 N61311  untrt untrt SRR10395…       126 SRX384345  SRS50… SAMN0242…
## 2 GSM1275863 N61311  trt   untrt SRR10395…       126 SRX384346  SRS50… SAMN0242…
## 3 GSM1275866 N052611 untrt untrt SRR10395…       126 SRX384349  SRS50… SAMN0242…
## 4 GSM1275867 N052611 trt   untrt SRR10395…        87 SRX384350  SRS50… SAMN0242…
## 5 GSM1275870 N080611 untrt untrt SRR10395…       120 SRX384353  SRS50… SAMN0242…
## 6 GSM1275871 N080611 trt   untrt SRR10395…       126 SRX384354  SRS50… SAMN0242…
## 7 GSM1275874 N061011 untrt untrt SRR10395…       101 SRX384357  SRS50… SAMN0242…
## 8 GSM1275875 N061011 trt   untrt SRR10395…        98 SRX384358  SRS50… SAMN0242…

#raw count data
acount<-read_csv("./data/airway_rawcount.csv") %>%
  dplyr::rename("Feature" = "...1")

## New names:
## Rows: 64102 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): ...1 dbl (8): SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516,
## SRR1039...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

acount

## # A tibble: 64,102 × 9
##    Feature     SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
##    <chr>            <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1 ENSG000000…        679        448        873        408       1138       1047
##  2 ENSG000000…          0          0          0          0          0          0
##  3 ENSG000000…        467        515        621        365        587        799
##  4 ENSG000000…        260        211        263        164        245        331
##  5 ENSG000000…         60         55         40         35         78         63
##  6 ENSG000000…          0          0          2          0          1          0
##  7 ENSG000000…       3251       3679       6177       4252       6721      11027
##  8 ENSG000000…       1433       1062       1733        881       1424       1439
##  9 ENSG000000…        519        380        595        493        820        714
## 10 ENSG000000…        394        236        464        175        658        584
## # ℹ 64,092 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>

#differential expression results
dexp<-read_delim("./data/diffexp_results_edger_airways.txt")

## Rows: 15926 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): feature, albut, transcript, ref_genome
## dbl (5): logFC, logCPM, F, PValue, FDR
## lgl (1): .abundant
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dexp

## # A tibble: 15,926 × 10
##    feature  albut transcript ref_genome .abundant   logFC logCPM       F  PValue
##    <chr>    <chr> <chr>      <chr>      <lgl>       <dbl>  <dbl>   <dbl>   <dbl>
##  1 ENSG000… untrt TSPAN6     hg38       TRUE      -0.390    5.06 32.8    3.12e-4
##  2 ENSG000… untrt DPM1       hg38       TRUE       0.198    4.61  6.90   2.81e-2
##  3 ENSG000… untrt SCYL3      hg38       TRUE       0.0292   3.48  0.0969 7.63e-1
##  4 ENSG000… untrt C1orf112   hg38       TRUE      -0.124    1.47  0.377  5.55e-1
##  5 ENSG000… untrt CFH        hg38       TRUE       0.417    8.09 29.3    4.63e-4
##  6 ENSG000… untrt FUCA2      hg38       TRUE      -0.250    5.91 14.9    4.05e-3
##  7 ENSG000… untrt GCLC       hg38       TRUE      -0.0581   4.84  0.167  6.92e-1
##  8 ENSG000… untrt NFYA       hg38       TRUE      -0.509    4.13 44.9    1.00e-4
##  9 ENSG000… untrt STPG1      hg38       TRUE      -0.136    3.12  1.04   3.35e-1
## 10 ENSG000… untrt NIPAL3     hg38       TRUE      -0.0500   7.04  0.350  5.69e-1
## # ℹ 15,916 more rows
## # ℹ 1 more variable: FDR <dbl>

Joining data frames

Often related data is stored across multiple data frames. In such cases, while each data frame likely contains different types of data, an identifier column or key (e.g., sampleID) can be used to unite or combine aspects of the data.

There are a series of functions from dplyr devoted to the purpose of joining data frames. There are two types of joins: mutating joins and filtering joins.

Mutating joins

Imagine we have two data frames x and y. A mutating join will keep all columns from x and y by adding columns from y to x.

left_join() - Output contains all rows from x

return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. --- R documentation, dplyr (version 0.7.8)

right_join() - Output contains all rows from y

return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. --- R documentation, dplyr (version 0.7.8)

inner_join() - Output contains matched rows from x

return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. --- R documentation, dplyr (version 0.7.8)

Unmatched values from x will be dropped.

full_join() - Output contains all rows from x and y

return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. --- R documentation, dplyr (version 0.7.8)

Note

The R documentation for dplyr has been updated with dplyr v1.0.9. However, these descriptions still stand and are clearer (in my opinion) than the new documentation.

The most common type of join is the left_join(). Let's see this in action:

#reshape acount
acount_smeta<-acount %>% pivot_longer(where(is.numeric),names_to ="Sample",
                    values_to= "Count") %>% left_join(smeta, by=c("Sample"="Run"))
acount_smeta

## # A tibble: 512,816 × 11
##    Feature        Sample Count SampleName cell  dex   albut avgLength Experiment
##    <chr>          <chr>  <dbl> <chr>      <chr> <chr> <chr>     <dbl> <chr>     
##  1 ENSG000000000… SRR10…   679 GSM1275862 N613… untrt untrt       126 SRX384345 
##  2 ENSG000000000… SRR10…   448 GSM1275863 N613… trt   untrt       126 SRX384346 
##  3 ENSG000000000… SRR10…   873 GSM1275866 N052… untrt untrt       126 SRX384349 
##  4 ENSG000000000… SRR10…   408 GSM1275867 N052… trt   untrt        87 SRX384350 
##  5 ENSG000000000… SRR10…  1138 GSM1275870 N080… untrt untrt       120 SRX384353 
##  6 ENSG000000000… SRR10…  1047 GSM1275871 N080… trt   untrt       126 SRX384354 
##  7 ENSG000000000… SRR10…   770 GSM1275874 N061… untrt untrt       101 SRX384357 
##  8 ENSG000000000… SRR10…   572 GSM1275875 N061… trt   untrt        98 SRX384358 
##  9 ENSG000000000… SRR10…     0 GSM1275862 N613… untrt untrt       126 SRX384345 
## 10 ENSG000000000… SRR10…     0 GSM1275863 N613… trt   untrt       126 SRX384346 
## # ℹ 512,806 more rows
## # ℹ 2 more variables: Sample.y <chr>, BioSample <chr>

Notice the use of by in left_join. The argument by requires the column or columns that we want to join by. If the column we want to join by has a different name, we can use the notation above, which says to match Sample from acount to Run from smeta.

Filtering joins

Filtering joins result in filtered x data based on matching or non-matching with y. These joins do not add columns from y to x.

semi_join()

return all rows from x where there are matching values in y, keeping just columns from x.
A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. --- R documentation, dplyr (version 0.7.8)

anti_join()

return all rows from x where there are not matching values in y, keeping just columns from x. --- R documentation, dplyr (version 0.7.8)

Let's see a brief example of semi-join:

#reshape acount
smeta_f<-smeta %>% filter(Run %in% c("SRR1039512","SRR1039508"))  

acount_L<-acount %>% pivot_longer(where(is.numeric),names_to ="Sample",
                    values_to= "Count")

semi_join(acount_L,smeta_f, by=c("Sample"="Run"))

Note

This example does not use the %>%. This was simply to demonstrate the different "strategies" that can be used to set up and run code.

## # A tibble: 128,204 × 3
##    Feature         Sample     Count
##    <chr>           <chr>      <dbl>
##  1 ENSG00000000003 SRR1039508   679
##  2 ENSG00000000003 SRR1039512   873
##  3 ENSG00000000005 SRR1039508     0
##  4 ENSG00000000005 SRR1039512     0
##  5 ENSG00000000419 SRR1039508   467
##  6 ENSG00000000419 SRR1039512   621
##  7 ENSG00000000457 SRR1039508   260
##  8 ENSG00000000457 SRR1039512   263
##  9 ENSG00000000460 SRR1039508    60
## 10 ENSG00000000460 SRR1039512    40
## # ℹ 128,194 more rows

In this case, we could have used filter. However, if we were interested in filtering by multiple variables simultaneously, it would be easier to employ a semi-join.

Transforming variables

Data wrangling often involves transforming one variable to another. For example, we may be interested in log transforming a variable or adding two variables to create a third. In dplyr this can be done with mutate() and transmute(). These functions allow us to create a new variable from existing variables.

`mutate()`

mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name. --- dplyr.tidyverse.org

Let's create a column in our original differential expression data frame denoting significant transcripts (those with an FDR corrected p-value less than 0.05 and a log fold change greater than or equal to 2).

dexp_sigtrnsc<-dexp %>% mutate(Significant= FDR<0.05 & abs(logFC) >=2)
head(dexp_sigtrnsc$Significant)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

This creates a column named Significant that contains TRUE values where the expression above was true (meaning significant in this case) and FALSE where the expression was FALSE.

Let's look at another example. This time let's log transform our FDR corrected p-values.

exmut<-dexp %>% mutate(logFDR = log10(FDR)) 
exmut["logFDR"]

## # A tibble: 15,926 × 1
##     logFDR
##      <dbl>
##  1 -2.55  
##  2 -1.11  
##  3 -0.0735
##  4 -0.166 
##  5 -2.42  
##  6 -1.73  
##  7 -0.100 
##  8 -2.90  
##  9 -0.320 
## 10 -0.158 
## # ℹ 15,916 more rows

Mutating several variables at once

What if we want to transform all of our counts spread across multiple columns in acount using scale(), which applies a z-score transformation? In this case we use across() within mutate(), which has replaced the scoped verbs (mutate_if,mutate_at, and mutate_all).

Z-score

A z score tells us the number of standard deviations from the mean of a given value. This can be achieved by scale(x, center = TRUE, scale = TRUE).

Let's see this in action.

acount %>% mutate(across(where(is.numeric),scale))

## # A tibble: 64,102 × 9
##    Feature         SRR1039508[,1] SRR1039509[,1] SRR1039512[,1] SRR1039513[,1]
##    <chr>                    <dbl>          <dbl>          <dbl>          <dbl>
##  1 ENSG00000000003         0.103          0.0527         0.0991         0.0643
##  2 ENSG00000000005        -0.0929        -0.100         -0.0821        -0.0887
##  3 ENSG00000000419         0.0418         0.0756         0.0468         0.0482
##  4 ENSG00000000457        -0.0179        -0.0281        -0.0275        -0.0272
##  5 ENSG00000000460        -0.0756        -0.0814        -0.0738        -0.0756
##  6 ENSG00000000938        -0.0929        -0.100         -0.0817        -0.0887
##  7 ENSG00000000971         0.845          1.16           1.20           1.51  
##  8 ENSG00000001036         0.321          0.262          0.278          0.242 
##  9 ENSG00000001084         0.0568         0.0295         0.0414         0.0962
## 10 ENSG00000001167         0.0208        -0.0196         0.0142        -0.0231
## # ℹ 64,092 more rows
## # ℹ 4 more variables: SRR1039516 <dbl[,1]>, SRR1039517 <dbl[,1]>,
## #   SRR1039520 <dbl[,1]>, SRR1039521 <dbl[,1]>

For further information on across, check out this great tutorial here.

Coercing variables with `mutate`

Mutate can also be used to coerce variables. Again, we need to use across() and where(). Let's also check out the difference between mutate() and transmute().

#compare transmute to mutate
ex_coerce<-acount_smeta %>% transmute(across(where(is.character),as.factor)) 
glimpse(ex_coerce)

## Rows: 512,816
## Columns: 9
## $ Feature    <fct> ENSG00000000003, ENSG00000000003, ENSG00000000003, ENSG0000…
## $ Sample     <fct> SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516,…
## $ SampleName <fct> GSM1275862, GSM1275863, GSM1275866, GSM1275867, GSM1275870,…
## $ cell       <fct> N61311, N61311, N052611, N052611, N080611, N080611, N061011…
## $ dex        <fct> untrt, trt, untrt, trt, untrt, trt, untrt, trt, untrt, trt,…
## $ albut      <fct> untrt, untrt, untrt, untrt, untrt, untrt, untrt, untrt, unt…
## $ Experiment <fct> SRX384345, SRX384346, SRX384349, SRX384350, SRX384353, SRX3…
## $ Sample.y   <fct> SRS508568, SRS508567, SRS508571, SRS508572, SRS508575, SRS5…
## $ BioSample  <fct> SAMN02422669, SAMN02422675, SAMN02422678, SAMN02422670, SAM…

#mutate
ex_coerce<-acount_smeta %>% mutate(across(where(is.character),as.factor)) 
glimpse(ex_coerce)

## Rows: 512,816
## Columns: 11
## $ Feature    <fct> ENSG00000000003, ENSG00000000003, ENSG00000000003, ENSG0000…
## $ Sample     <fct> SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516,…
## $ Count      <dbl> 679, 448, 873, 408, 1138, 1047, 770, 572, 0, 0, 0, 0, 0, 0,…
## $ SampleName <fct> GSM1275862, GSM1275863, GSM1275866, GSM1275867, GSM1275870,…
## $ cell       <fct> N61311, N61311, N052611, N052611, N080611, N080611, N061011…
## $ dex        <fct> untrt, trt, untrt, trt, untrt, trt, untrt, trt, untrt, trt,…
## $ albut      <fct> untrt, untrt, untrt, untrt, untrt, untrt, untrt, untrt, unt…
## $ avgLength  <dbl> 126, 126, 126, 87, 120, 126, 101, 98, 126, 126, 126, 87, 12…
## $ Experiment <fct> SRX384345, SRX384346, SRX384349, SRX384350, SRX384353, SRX3…
## $ Sample.y   <fct> SRS508568, SRS508567, SRS508571, SRS508572, SRS508575, SRS5…
## $ BioSample  <fct> SAMN02422669, SAMN02422675, SAMN02422678, SAMN02422670, SAM…

Notice that transmute dropped all columns that were not mutated.

Using `rowwise()` and `mutate()`

What if we wanted a new column that stored the mean of each row in our data frame?

Let's create a small data frame, and use mutate() to get the mean(). What happens when we use mean as is?

df<-data.frame(A=c(1,2,3),B=c(4,5,6),C=c(7,8,9))
df

##   A B C
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9

df %>% mutate(D= mean(c(A,B,C)))

##   A B C D
## 1 1 4 7 5
## 2 2 5 8 5
## 3 3 6 9 5

df %>% mutate(D = (A+B+C)/3)

##   A B C D
## 1 1 4 7 4
## 2 2 5 8 5
## 3 3 6 9 6

The first example simply gives us the mean of A, B, and C (not row wise). The second example gave us what we wanted, but was more complicated. The alternative is to first group by row using rowwise() and then use mutate().

df %>% rowwise() %>% mutate(D= mean(c(A,B,C)))

## # A tibble: 3 × 4
## # Rowwise: 
##       A     B     C     D
##   <dbl> <dbl> <dbl> <dbl>
## 1     1     4     7     4
## 2     2     5     8     5
## 3     3     6     9     6

See more uses of rowwise() operations here.

Group_by and summarize

There is an approach to data analysis known as "split-apply-combine", in which the data is split into smaller components, some type of analysis is applied to each component, and the results are combined. The dplyr functions including group_by() and summarize() are key players in this type of workflow.

group_by() allows us to group a data frame by a categorical variable so that a given operation can be performed per group / category.

Let's get the top five genes with the greatest median raw counts by treatment.

#Call the data 
acount_smeta %>% 
  # group_by dex and Feature (Feature nested within treatment)
  group_by(dex,Feature) %>%
  #for each group calculate the median value of raw counts
  summarize(median_counts=median(Count)) %>% 
  #arrange in descending order
  arrange(desc(median_counts),.by_group = TRUE) %>% 
  #return the top 5 values for each group
  slice_head(n=5)

## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.

## # A tibble: 10 × 3
## # Groups:   dex [2]
##    dex   Feature         median_counts
##    <chr> <chr>                   <dbl>
##  1 trt   ENSG00000115414       322164 
##  2 trt   ENSG00000011465       263587 
##  3 trt   ENSG00000156508       239676.
##  4 trt   ENSG00000198804       230992 
##  5 trt   ENSG00000116260       187288.
##  6 untrt ENSG00000011465       336076 
##  7 untrt ENSG00000115414       302956.
##  8 untrt ENSG00000156508       294097 
##  9 untrt ENSG00000164692       249846 
## 10 untrt ENSG00000198804       249206

#can skip arrange and use slice_max
acount_smeta %>% 
  group_by(dex,Feature) %>% 
  summarize(median_counts=median(Count))  %>% 
  slice_max(n=5, order_by=median_counts) #notice use of slice_max

## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.

## # A tibble: 10 × 3
## # Groups:   dex [2]
##    dex   Feature         median_counts
##    <chr> <chr>                   <dbl>
##  1 trt   ENSG00000115414       322164 
##  2 trt   ENSG00000011465       263587 
##  3 trt   ENSG00000156508       239676.
##  4 trt   ENSG00000198804       230992 
##  5 trt   ENSG00000116260       187288.
##  6 untrt ENSG00000011465       336076 
##  7 untrt ENSG00000115414       302956.
##  8 untrt ENSG00000156508       294097 
##  9 untrt ENSG00000164692       249846 
## 10 untrt ENSG00000198804       249206

How many rows per sample are in the acount_smeta data frame? We can use count() or summarize() paired with other functions (e.g., n(),tally()).

acount_smeta  %>% 
  count(dex, Sample)

## # A tibble: 8 × 3
##   dex   Sample         n
##   <chr> <chr>      <int>
## 1 trt   SRR1039509 64102
## 2 trt   SRR1039513 64102
## 3 trt   SRR1039517 64102
## 4 trt   SRR1039521 64102
## 5 untrt SRR1039508 64102
## 6 untrt SRR1039512 64102
## 7 untrt SRR1039516 64102
## 8 untrt SRR1039520 64102

acount_smeta  %>% 
  group_by(dex, Sample) %>% 
  summarize(n=n()) #there are multiple functions that can be used here

## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.

## # A tibble: 8 × 3
## # Groups:   dex [2]
##   dex   Sample         n
##   <chr> <chr>      <int>
## 1 trt   SRR1039509 64102
## 2 trt   SRR1039513 64102
## 3 trt   SRR1039517 64102
## 4 trt   SRR1039521 64102
## 5 untrt SRR1039508 64102
## 6 untrt SRR1039512 64102
## 7 untrt SRR1039516 64102
## 8 untrt SRR1039520 64102

This output makes sense, as there are 64,102 unique Ensembl ids (n_distinct(acount_smeta$Feature))

Missing Values

By default, all [built in] R functions operating on vectors that contain missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm = TRUE (rm stands for remove). ---datacarpentry.org

Let's see this in practice

#This is used to get the same result
#with a pseudorandom number generator like sample()
set.seed(138) 

#make mock data frame
fun_df<-data.frame(genes=rep(c("A","B","C","D"), each=3),
                   counts=sample(1:500,12,TRUE)) %>% 
  #Assign NAs if the value is less than 100. This is arbitrary. 
  mutate(counts=replace(counts, counts<100, NA))

#let's view
fun_df

##    genes counts
## 1      A     NA
## 2      A    214
## 3      A     NA
## 4      B    352
## 5      B    256
## 6      B     NA
## 7      C    400
## 8      C    381
## 9      C    250
## 10     D    278
## 11     D     NA
## 12     D    169

#Summarize mean, median, min, and max
fun_df %>% 
  group_by(genes) %>%
  summarize(
    mean_count = mean(counts),
    median_count = median(counts),
    min_count = min(counts),
    max_count = max(counts))

## # A tibble: 4 × 5
##   genes mean_count median_count min_count max_count
##   <chr>      <dbl>        <int>     <int>     <int>
## 1 A            NA            NA        NA        NA
## 2 B            NA            NA        NA        NA
## 3 C           344.          381       250       400
## 4 D            NA            NA        NA        NA

#use na.rm
fun_df %>% 
  group_by(genes) %>%
  summarize(
    mean_count = mean(counts, na.rm=TRUE),
    median_count = median(counts, na.rm=TRUE),
    min_count = min(counts, na.rm=TRUE),
    max_count = max(counts, na.rm=TRUE))

## # A tibble: 4 × 5
##   genes mean_count median_count min_count max_count
##   <chr>      <dbl>        <dbl>     <int>     <int>
## 1 A           214          214        214       214
## 2 B           304          304        256       352
## 3 C           344.         381        250       400
## 4 D           224.         224.       169       278

Similar to mutate, we can summarize across multiple columns at once using across(). For example, let's get the mean of logFC and logCPM.

dexp %>% 
  summarize(across(starts_with("Log"), mean))

## # A tibble: 1 × 2
##      logFC logCPM
##      <dbl>  <dbl>
## 1 -0.00859   3.71

Exporting files using the `write` functions

We have learned how to import data using the read functions, but how can we export / write out data? We can use a series of write functions. Some examples from readr include write_csv(), write_delim(), write_tsv(). Some examples from base R include write.csv(), write.table(), writeLines().

Let's export a tab delimited file containing acount_smeta.

write_tsv(acount_smeta, "countsANDmeta.txt", quote="none")

Acknowledgments

Some material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org. Additional content was inspired by Chapter 13, Relational Data, from R for Data Science and Suzan Baert's dplyr tutorials.

dplyr: joining, tranforming, and summarizing data frames

Loading dplyr