Workshop 2: Loading and Manipulating Data
QCBS R Workshop Series
Québec Centre for Biodiversity Science
1 / 124

About this workshop

2 / 124

Required packages

install.packages(c('dplyr', 'tidyr', 'magrittr'))

3 / 124

Learning Objectives

1. Creating an R project

2. Writing a script

3. Loading, exploring, and saving data

4. Learn to manipulate data frames with `tidyr`, `dplyr`, and `maggritr`

4 / 124

Creating an R project5 / 124

RStudio projects

Projects make easy to keep your work organized.
All files, scripts, and documentation related to a specific project are bound together with a .Rproj file

Encourages reproducible code!

6 / 124

Creating a new project

Use the New project command

(available in the global toolbar: File -> New Project)

7 / 124

One project = one folder

Keep your files organized!

8 / 124

Preparing data for R

Dataset should be stored as comma-separated value files (.csv) in the working directory

Can be created from almost all applications (Excel, LibreOffice, Google Docs)
File -> Save as .csv…

9 / 124

Naming files

Avoid using spaces, accents or special characters within filenames.

Recommended

rawDatasetAgo2017.csv
co2_concentrations_QB.csv
01_figIntro.R

Not recommended

final.csv
safnnejs.csv
1-4.csv

10 / 124

Naming variables

Avoid using spaces, accents or special characters within variable names.

Use separators (e.g. "_") to add details

Recommended

Measurements
speciesNames
Site_001_Rep_002

Not recommended

a
3
complicatedverylongname

Check the tidyverse style guide for recommended coding habits.

11 / 124

Common data preparation mistakes

Here are a few examples of mistakes that originate confusion when preparing and manipulating data:

text in numeric columns;
typos;
numeric names for non-numeric data;

inconsistent formats for dates, numbers, metrics, and other units;
incorrect headings;
merged cells.

Read the publication by Broman & Woo (2017) on useful tips on data organization in spreadsheets.

12 / 124

Non-recommended data tables

13 / 124

Horrible data sheets

Not the best practices for data preservation and interoperability

14 / 124

Preparing data in `R`

It is possible to prepare your data entirely within R as it:

Saves time for large datasets;
Keeps original data intact;
Keeps track of the manipulation and transformation you did; and,
Can switch between long- and wide-formats easily (more details later).

Useful tips on data preparation can be found here: https://www.zoology.ubc.ca/~schluter/R/data/

15 / 124

Writing a Script16 / 124

`R` Scripts

What is a script?
- A text file storing the commands used for a particular analysis
Once written and saved, your script file allows you to make changes and re-run analyses with minimal effort!
- Just highlight text and click "Run" or press Command + Enter (Apple) or Ctrl + Enter / Ctrl + R (Windows and Linux)

17 / 124

Creating an `R` script

18 / 124

Creating an `R` script

19 / 124

Commands and comments

The # symbol in a script tells R to not run and ignore anything written past it within the same line

a <- 1 + 1
# This is a comment

b <- 2 + 2 # This is a comment

20 / 124

Why should I use `#`?

Annotating someone’s script is a good way to learn;
Tell collaborators what you did;
Good step towards reproducible science.

Be as detailed as possible!

21 / 124

It is recommended that you start your script with a header using comments:

22 / 124

Section Headings

You can make a section heading in R Studio with four # signs

# You can comment using this. 
# Look below how to create section headers:
#### Heading name ####
# You can also do this:
# Section One ---------------------------------
# Section Two =================================
### Section Three #############################

23 / 124

Section Headings

This allows you to move quickly between sections and hide sections

24 / 124

Preparing your `R` Environment: Housekeeping

The first command at the top of all scripts may be: rm(list = ls()).

This command:

Clears objects from the memory and the disk;
Prevents errors related to the use of out-dated data.

# Clear the R Workspace
rm(list = ls())
?rm
?ls

25 / 124

Preparing your `R` Environment: Housekeeping

DIY: Try adding some test data to R and then see how rm(list = ls()) removes it

A<-"Test" # Put some data in workspace
A <- "Test" # Add some spaces to organize your data!
A = "Test" # You can do this, but it does not mean you should
# Check objects in the workspace
ls()
# [1] "A"
A
# [1] "Test"
# Clean Workspace
rm(list=ls())
A

# Error in eval(expr, envir, enclos): object 'A' not found

26 / 124

Remember

R is ready for commands when you see the chevron > in the console. If you do not see it, press Esc
R is case sensitive!

27 / 124

Loading, exploring and saving data28 / 124

Download today's data

You can download the data for this workshop here:

https://github.com/QCBSRworkshops/workshop02/tree/main/pres-en/data

And the script here:

https://r.qcbs.ca/workshop02/book-en/workshop02-script-en.R

Save the files in the folder where your created your R project.

29 / 124

Download today's data

You can download the data for this workshop here:

https://github.com/QCBSRworkshops/workshop02/tree/main/pres-en/data

And the script here:

https://r.qcbs.ca/workshop02/book-en/workshop02-script-en.R

Save the files in the folder where your created your R project.

NOTE Many databases are already available within R

# Complete list of available data on base R
library(help = "datasets")

30 / 124

Working Directory

If you are not using an RStudio project, you have to tell R where your scripts and data are.Type getwd() in the console to see your current working directory:

getwd()

If you want to change your working directory use setwd():

setwd("C:/Users/my_new_path")

It is recommended to create an RStudio project and avoid using setwd() to promote reproducibility

31 / 124

Display contents of the directory

You can display contents of the working directory using dir():

dir()
#  [1] "assets"                   "data"                    
#  [3] "images"                   "qcbsR-fonts.css"         
#  [5] "qcbsR-header.html"        "qcbsR-macros.js"         
#  [7] "qcbsR.css"                "workshop02-en.Rproj"     
#  [9] "workshop02-pres-en_cache" "workshop02-pres-en.Rmd"

It helps to:

Check that the file you plan to open is present in the folder that R is currently working in
Check for correct spelling (e.g. co2_good.csv instead of CO2_good.csv)

32 / 124

Importing Data

Import data into R using read.csv():

CO2 <- read.csv("data/co2_good.csv", header=TRUE)

Note that this will:

Create a new object in R called CO2;
The file name is written within quotation marks ('file' or "file");
If you needed to fetch the file name from another directory that is not yours, you would have to write the full extension (e.g., "C:/Users/Mario/Downloads/co2_good.csv")
header=TRUE tells R that the first line of your dataset contains column names

33 / 124

Importing Data

Recall to find out what arguments the function requires, use help “?” I

?read.csv

Note that if your operating system or CSV editor (e.g. Excel) is in French, you may have to use read.csv2()

?read.csv2

34 / 124

Importing Data

Notice that RStudio now provides information on the CO2 data in your workspace.

The workspace refers to all the objects that you create during an R session.

35 / 124

Inspecting the data

R Command	Action
`CO2`	look at the whole dataframe
`view(CO2)`	look data as a spreadsheet
`head(CO2)`	look at the first few rows
`tail(CO2)`	look at the last few rows
`names(CO2)`	names of the columns in the dataframe
`attributes(CO2)`	attributes of the dataframe
`dim(CO2)`	dimensions of the dataframe
`ncol(CO2)`	number of columns
`nrow(CO2)`	number of rows
`summary(CO2)`	basic statistics for each variables

NOTE These functions could also be used with other objects such as vector, matrix, list, array, etc.

36 / 124

Inspecting the data

str(CO2)
# 'data.frame':    84 obs. of  5 variables:
#  $ Plant    : chr  "Qn1" "Qn1" "Qn1" "Qn1" ...
#  $ Type     : chr  "Quebec" "Quebec" "Quebec" "Quebec" ...
#  $ Treatment: chr  "nonchilled" "nonchilled" "nonchilled" "nonchilled" ...
#  $ conc     : int  95 175 250 350 500 675 1000 95 175 250 ...
#  $ uptake   : num  16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...

This shows the structure of the dataframe. Very useful to check data type (mode) of all columns to make sure R loaded data properly.

Note: the CO2 dataset includes repeated measurements of CO2 uptake from 6 plants from Quebec and 6 plants from Mississippi at several levels of CO2 concentration. Half the plants of each type were chilled overnight before the experiment was conducted.

37 / 124

Inspecting the data

Common problems:

Factors loaded as text (character) or viceversa
Factors includes too many (and unexpected) levels because of typos
Data (integer or numeric) is loaded as character because of typos (e.g. a space or a "," instead of a "." to separate decimal numbers)

38 / 124

Exercise

Load the data with:

CO2 <- read.csv("data/co2_good.csv", header = FALSE)

Check data types with str() again.

What is wrong here?

39 / 124

Exercise

Load the data with:

CO2 <- read.csv("data/co2_good.csv", header = FALSE)

Check data types with str() again.

What is wrong here?

Do not forget to re-load data with header = TRUE afterwards

40 / 124

Reminder from Workshop 1: Accessing data

Imagine a data frame called mydata:

mydata[2,3] # Extracts the content of row 2 / column 3
mydata[1,] # Extracts the first row
mydata[,1] # Extracts the first column
mydata[,1][2] # [...] can be also be used recursively
mydata$Variable1 # Also extracts the first column

41 / 124

Renaming variables

Variables names can be changed within R.

# First lets make a copy of the dataset to play with
CO2copy <- CO2
# names() gives you the names of
# the variables present in the data frame
names(CO2copy)
# [1] "Plant"     "Type"      "Treatment" "conc"      "uptake"
# Changing from English to French names 
# (make sure you have the same levels!)
names(CO2copy) <- c("Plante","Categorie", 
                    "Traitement", "conc",
                    "absortion")
names(CO2copy)
# [1] "Plante"     "Categorie"  "Traitement" "conc"       "absortion"

42 / 124

Creating new variables

Variables and strings can be concatenated together. The function paste() is very useful for concatenating. See ?paste and ?paste0.

# Let us create an unique id for our samples:
# Do not forget to use "" for strings
CO2copy$uniqueID <- paste0(CO2copy$Plante,
                           "_",CO2copy$Categorie,
                           "_", CO2copy$Traitement)
# observe the results
head(CO2copy$uniqueID)
# [1] "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled"
# [4] "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled"

43 / 124

Creating new variables

Creating new variables works for numbers and mathematical operations as well!

# Let us standardize our variable "absortion" to relative values
CO2copy$absortionRel <- CO2copy$absortion/max(CO2copy$absortion)
# Observe the results
head(CO2copy$absortionRel)

44 / 124

Subsetting data

There are many ways to subset a data frame

# Let us keep working with our CO2copy data frame
# Select only "Plante" and "absortionRel" columns.
# (Do not forget the ","!)
CO2copy[,c("Plante", "absortionRel")]
# Subset data frame from rows from 1 to 50
CO2copy[1:50,]

45 / 124

Subsetting data

# Select observations matching only the nonchilled Traitement.
CO2copy[CO2copy$Traitement == "nonchilled",]
# Select observations with absortion 
# higher or equal to 20
CO2copy[CO2copy$absortion >= 20, ]
# Select observations with absortion 
# higher or equal to 20
CO2copy[CO2copy$Traitement  == "nonchilled" &
          CO2copy$absortion >= 20, ]
# We are done playing with the Dataset copy, lets erase it.
rm(CO2copy)

Go here to check all the logical operators you can use

46 / 124

Data exploration

A good way to start your data exploration is to look at some basic statistics of your dataset.

Use the summary() function to do that!

summary(CO2)

This is also useful to spot some errors you might have missed!

47 / 124

Data exploration

You can also use other functions to calculate basic statistics on parts of your data frame. Let us try the mean(), sd() and hist() functions:

# Calculate the mean and the standard deviation 
# of the CO2 concentration:
# Assign them to new variables
meanConc <- mean(CO2$conc)
sdConc <- sd(CO2$conc)

# print() prints any given value to the R console
print(paste("the mean of concentration is:", meanConc))
# [1] "the mean of concentration is: 435"
print(paste("the standard deviation of concentration is:", sdConc))
# [1] "the standard deviation of concentration is: 295.924119222056"

48 / 124

Data exploration

# Let us plot a histogram to explore the distribution of "uptake"
hist(CO2$uptake)

49 / 124

Data exploration

# Increasing the number of bins to observe better the pattern
hist(CO2$uptake, breaks = 40)

50 / 124

The `apply()` function

The function apply() can be used to apply a function to multiple columns of your data simultaneously. Use the ?apply command to get more information about apply().

?apply

To use apply, you have to specify three arguments. The first argument is the data you would like to apply the function to; the second argument is whether you would like to calculate based on rows (1) or columns (2) of your dataset; the third argument is the function you would like to apply. For example:

# Calculate mean of the two columns in the data frame that contain continuous data
apply(CO2[,4:5], MARGIN = 2, FUN = mean)

51 / 124

Saving your Workspace

# Saving an R workspace file that stores all your objects
save.image(file="data/co2_project_Data.RData")
# Clear your memory
rm(list = ls())
# Reload your data
load("data/co2_project_Data.RData")
head(CO2) # Looking good!

#   Plant   Type  Treatment conc uptake
# 1   Qn1 Quebec nonchilled   95   16.0
# 2   Qn1 Quebec nonchilled  175   30.4
# 3   Qn1 Quebec nonchilled  250   34.8
# 4   Qn1 Quebec nonchilled  350   37.2
# 5   Qn1 Quebec nonchilled  500   35.3
# 6   Qn1 Quebec nonchilled  675   39.2

52 / 124

Exporting data

R disposes of write functions that allow you to write objects directly to files in your computer. Let us use the write.csv() function to save our CO2 data into a .CSV file:

write.csv(CO2, file = "data/co2_new.csv")

Note that our arguments are both:

CO2 Object (name)

"co2_new.csv" File to write (name)

53 / 124

Challenge: Use your data

Try to load, explore, plot, and save your own data in R;
If it does not load properly, try to make the appropriate changes;
When you are finished, try opening your exported data in Excel, Numbers, Notepad or TextEdit.

If you don’t have your own data, work with your neighbor

Remember to clean your workspace

54 / 124

Database repair55 / 124

Database repair

Getting your data working properly can be tougher than you think!

For example, sharing data from an Apple computer to Windows, or between computers set up in different continents can lead to incompatible files (e.g. different decimal separators).

Let us practice how to solve these common errors.

56 / 124

Database repair

Exercise:

Read the file co2_broken.csv

CO2 <- read.csv("data/co2_broken.csv")
head(CO2)
#               NOTE..It.rain.a.lot.in.Quebec.during.sampling due.to.excessive  X
# 1 falling on my notebook numerous values can't be read rain               NA NA
# 2                      Plant\tType\tTreatment\tconc\tuptake               NA NA
# 3                           Qn1\tQuebec\tnonchilled\t95\t16               NA NA
# 4                        Qn1\tQuebec\tnonchilled\t175\t30.4               NA NA
# 5           Qn1\tQuebec\tnonchilled\t250\tcannot_read_notes               NA NA
# 6                        Qn1\tQuebec\tnonchilled\t350\t37.2               NA NA
#   X.1 X.2 X.3
# 1  NA  NA  NA
# 2  NA  NA  NA
# 3  NA  NA  NA
# 4  NA  NA  NA
# 5  NA  NA  NA
# 6  NA  NA  NA

57 / 124

Database repair

Some useful functions:

head() - first few rows
str() - structure of data
class() - class of the object
unique() - unique observations
levels() - levels of a factor
which() - ask a question to your data frame
droplevels() - get rid of undesired levels after sub setting factors

add "?" (e.g. ?read.csv to the function name to access the function help page)

HINT There are 4 problems!

58 / 124

Database repair

ERROR 1 The data appears to be lumped into one column

head(CO2)
#               NOTE..It.rain.a.lot.in.Quebec.during.sampling due.to.excessive  X
# 1 falling on my notebook numerous values can't be read rain               NA NA
# 2                      Plant\tType\tTreatment\tconc\tuptake               NA NA
# 3                           Qn1\tQuebec\tnonchilled\t95\t16               NA NA
# 4                        Qn1\tQuebec\tnonchilled\t175\t30.4               NA NA
# 5           Qn1\tQuebec\tnonchilled\t250\tcannot_read_notes               NA NA
# 6                        Qn1\tQuebec\tnonchilled\t350\t37.2               NA NA
#   X.1 X.2 X.3
# 1  NA  NA  NA
# 2  NA  NA  NA
# 3  NA  NA  NA
# 4  NA  NA  NA
# 5  NA  NA  NA
# 6  NA  NA  NA

59 / 124

Database repair

ERROR 1 - Solution

Re-import the data, but specify the separation among entries
The sep argument tells R what character separates the values on each line of the file
Here, "TAB" was used instead of ","

CO2 <- read.csv("data/co2_broken.csv",sep = "")

60 / 124

Database repair

ERROR 2 The data does not start until the third line of the file, so you end up with notes on the file as the headings.

head(CO2)
#     NOTE.     It       rain        a               lot    in. Quebec during
# 1 falling     on         my notebook          numerous values  can't     be
# 2   Plant   Type  Treatment     conc            uptake                     
# 3     Qn1 Quebec nonchilled       95                16                     
# 4     Qn1 Quebec nonchilled      175              30.4                     
# 5     Qn1 Quebec nonchilled      250 cannot_read_notes                     
# 6     Qn1 Quebec nonchilled      350              37.2                     
#   sampling.      due to excessive X....
# 1      read rain,,,, NA        NA    NA
# 2                    NA        NA    NA
# 3                    NA        NA    NA
# 4                    NA        NA    NA
# 5                    NA        NA    NA
# 6                    NA        NA    NA

61 / 124

Database repair

ERROR 2 - Solution

Skip two lines when loading the file using the "skip" argument:

CO2 <- read.csv("data/co2_broken.csv", sep = "", skip = 2)
head(CO2)
#   Plant   Type  Treatment              conc            uptake
# 1   Qn1 Quebec nonchilled                95                16
# 2   Qn1 Quebec nonchilled               175              30.4
# 3   Qn1 Quebec nonchilled               250 cannot_read_notes
# 4   Qn1 Quebec nonchilled               350              37.2
# 5   Qn1 Quebec nonchilled               500              35.3
# 6   Qn1 Quebec nonchilled cannot_read_notes              39.2

62 / 124

Database repair

ERROR 3 conc and uptake variables are considered factors instead of numbers, because there are comments in the numeric columns

str(CO2)
# 'data.frame':    84 obs. of  5 variables:
#  $ Plant    : chr  "Qn1" "Qn1" "Qn1" "Qn1" ...
#  $ Type     : chr  "Quebec" "Quebec" "Quebec" "Quebec" ...
#  $ Treatment: chr  "nonchilled" "nonchilled" "nonchilled" "nonchilled" ...
#  $ conc     : chr  "95" "175" "250" "350" ...
#  $ uptake   : chr  "16" "30.4" "cannot_read_notes" "37.2" ...

unique(CO2$conc)
# [1] "95"                "175"               "250"              
# [4] "350"               "500"               "cannot_read_notes"
# [7] "1000"              "675"

Due to missing values entered as "cannot_read_notes" and "na"
Recall that R only recognizes "NA" (capital)

63 / 124

`?read.csv`

64 / 124

Database repair

ERROR 3 - Solution

Tell R that all of NA, "na", and "cannot_read_notes" should be considered NA. Then because all other values in those columns are numbers, conc and uptake will be loaded as numeric/integer.

CO2 <- read.csv("data/co2_broken.csv",
                sep = "", skip = 2,
                na.strings = c("NA","na",
                               "cannot_read_notes"))
str(CO2)
# 'data.frame':    84 obs. of  5 variables:
#  $ Plant    : chr  "Qn1" "Qn1" "Qn1" "Qn1" ...
#  $ Type     : chr  "Quebec" "Quebec" "Quebec" "Quebec" ...
#  $ Treatment: chr  "nonchilled" "nonchilled" "nonchilled" "nonchilled" ...
#  $ conc     : int  95 175 250 350 500 NA 1000 95 175 250 ...
#  $ uptake   : num  16 30.4 NA 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...

65 / 124

Database repair

ERROR 4

There are only 2 treatments (chilled and non-chilled) but there are spelling errors causing it to look like 4 different treatments.

str(CO2)

levels(CO2$Treatment)
# NULL
unique(CO2$Treatment)
# [1] "nonchilled" "nnchilled"  "chilled"    "chiled"

66 / 124

Database repair

ERROR 4 - Solution

# Identify all rows that contain "nnchilled" 
# and replace with "nonchilled"
CO2$Treatment[CO2$Treatment=="nnchilled"] <- "nonchilled"
# Identify all rows that contain "chiled" 
# and replace with "chilled"
CO2$Treatment[CO2$Treatment=="chiled"] <- "chilled"

67 / 124

Database repair

ERROR 4 - Solution

After repairing the factors, we need to remove the unused factor levels

If not :

boxplot(uptake ~ Treatment, data = CO2)

68 / 124

Database repair

ERROR 4 - Solution

CO2 <- droplevels(CO2)
str(CO2)
# 'data.frame':    84 obs. of  5 variables:
#  $ Plant    : chr  "Qn1" "Qn1" "Qn1" "Qn1" ...
#  $ Type     : chr  "Quebec" "Quebec" "Quebec" "Quebec" ...
#  $ Treatment: chr  "nonchilled" "nonchilled" "nonchilled" "nonchilled" ...
#  $ conc     : int  95 175 250 350 500 NA 1000 95 175 250 ...
#  $ uptake   : num  16 30.4 NA 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...

69 / 124

Database repair

boxplot(uptake ~ Treatment, data = CO2)

70 / 124

Welcome to the tidyverse world!71 / 124

What is Tidyverse?

Tidyverse is a collection of packages designed for data science.

Tidyverse provides more efficient and user-friendly tools that you are likely to use in everyday data analyses.

All packages included in tidyverse are automatically installed when installing the tidyverse package: install.packages("tidyverse")

Tidyverse includes:

tidyr: tidy data
dplyr: data manipulation
maggritr: provide piping operators
ggplot2: make graphics
readr: read data (faster!)
lubridate: manipulate data-time data
and more!

72 / 124

Learn to manipulate data withtidyr, dyplr, and maggritr73 / 124

Using `tidyr` to reshape data frames

library(tidyr)

74 / 124

Data formats

Wide format

Data format has a separate column for each variable or each factor in your study

#   Species DBH Height
# 1     Oak  12     56
# 2     Elm  20     85
# 3     Ash  13     55

Long format

Data format has a column for possible variables and a column for the values of those variables

# # A tibble: 6 × 3
#   Species name   value
#   <chr>   <chr>  <dbl>
# 1 Oak     DBH       12
# 2 Oak     Height    56
# 3 Elm     DBH       20
# 4 Elm     Height    85
# 5 Ash     DBH       13
# 6 Ash     Height    55

Wide data frame can be used for some basic plotting in ggplot2, but more complex plots require long format (example to come)

dplyr, lm(), glm(), gam() all require long data format

75 / 124

Tidying your data

Tidying allows you to manipulate the structure of your data while preserving all original information

pivot_longer() - convert from wide to long format (gather)

pivot_wider() - convert from long to wide format (spread)

scale:90%

76 / 124

`tidyr` installation

install.packages("tidyr")
library(tidyr)

77 / 124

Gather columns into rows

pivot_longer(data, cols, names_to, values_to, ...)

data A data frame (e.g. wide)
cols name or numeric index of the columns we wish to gather (e.g. DBH, Height)
names_to name of the new column containing variable names (e.g. Dimension)
values_to name of the new column containing variable values (e.g. cm)

78 / 124

Example of `pivot_longer`

wide <- data.frame(Species = c("Oak", "Elm", "Ash"),
                   DBH     = c(12, 20, 13),
                   Height  = c(56, 85, 55))
wide
#   Species DBH Height
# 1     Oak  12     56
# 2     Elm  20     85
# 3     Ash  13     55

long <- pivot_longer(data      = wide, 
                     cols      = c("DBH", "Height"),
                     names_to  = "dimension", 
                     values_to = "cm")
long
# # A tibble: 6 × 3
#   Species dimension    cm
#   <chr>   <chr>     <dbl>
# 1 Oak     DBH          12
# 2 Oak     Height       56
# 3 Elm     DBH          20
# 4 Elm     Height       85
# 5 Ash     DBH          13
# 6 Ash     Height       55

79 / 124

Spread rows into columns

pivot_wider(data, names_from, values_from, ...)

data A data frame (e.g. long)
names_from name of the column containing the variable names (e.g. dimension)
values_from name of the column containing variable values (e.g. cm)

80 / 124

Disperser avec `pivot_wider()`

long
# # A tibble: 6 × 3
#   Species dimension    cm
#   <chr>   <chr>     <dbl>
# 1 Oak     DBH          12
# 2 Oak     Height       56
# 3 Elm     DBH          20
# 4 Elm     Height       85
# 5 Ash     DBH          13
# 6 Ash     Height       55

large2 <- pivot_wider(data        = long, 
                      names_from  = "dimension", 
                      values_from = "cm")
large2
# # A tibble: 3 × 3
#   Species   DBH Height
#   <chr>   <dbl>  <dbl>
# 1 Oak        12     56
# 2 Elm        20     85
# 3 Ash        13     55

81 / 124

A `tibble` structure for your data

Tibble is an alternate, more convenient, version for a data frame. Using tibble ensures good coding practices that can be missed using data.frame. For instance, it does not change the type of inputs (e.g. string to factors).

tibble(x = 1:3, y = c("a","b","c"))
# # A tibble: 3 × 2
#       x y    
#   <int> <chr>
# 1     1 a    
# 2     2 b    
# 3     3 c

82 / 124

A `tibble` structure for your data

In addition, tibble simplifies the use of nested lists.

Example:

tibble(x = 1:3, y = list(1:5, 1:10, 1:20))
# # A tibble: 3 × 2
#       x y         
#   <int> <list>    
# 1     1 <int [5]> 
# 2     2 <int [10]>
# 3     3 <int [20]>

You can use with a tibble all the functions applied to a data.frame.

83 / 124

`separate` columns

separate() splits a columns by a character string separator

separate(data, col, into, sep)

data A data frame (e.g. long)
col Name of the column you wish to separate
into Names of new variables to create
sep Character which indicates where to separate

84 / 124

Using `separate()` example

Create a fictional dataset about fish and plankton

set.seed(8)
messy <- data.frame(id = 1:4,
                    trt = sample(rep(c('control', 'farm'), each = 2)),
                    zooplankton.T1 = runif(4),
                    fish.T1 = runif(4),
                    zooplankton.T2 = runif(4),
                    fish.T2 = runif(4))
messy
#   id     trt zooplankton.T1    fish.T1 zooplankton.T2   fish.T2
# 1  1    farm      0.7189275 0.64449114    0.544962116 0.2644589
# 2  2    farm      0.2908734 0.45704489    0.138224346 0.2765322
# 3  3 control      0.9322698 0.08930101    0.927812252 0.5211070
# 4  4 control      0.7691470 0.43239137    0.001301721 0.2236889

85 / 124

Using `separate()` example

First convert the messy data frame from wide to long format

messy.long <- pivot_longer(messy,
                           names_to = "taxa",
                           cols     = c("zooplankton.T1", 
                                        "fish.T1",
                                        "zooplankton.T2",
                                        "fish.T2"))
head(messy.long)
# # A tibble: 6 × 4
#      id trt   taxa           value
#   <int> <chr> <chr>          <dbl>
# 1     1 farm  zooplankton.T1 0.719
# 2     1 farm  fish.T1        0.644
# 3     1 farm  zooplankton.T2 0.545
# 4     1 farm  fish.T2        0.264
# 5     2 farm  zooplankton.T1 0.291
# 6     2 farm  fish.T1        0.457

86 / 124

Using `separate()` example

Then we want to split the 2 sampling time (T1 and T2).

messy.long.sep <- separate(messy.long, taxa,
                           into = c("species", "time"), sep = "\\.")
head(messy.long.sep)
# # A tibble: 6 × 5
#      id trt   species     time  value
#   <int> <chr> <chr>       <chr> <dbl>
# 1     1 farm  zooplankton T1    0.719
# 2     1 farm  fish        T1    0.644
# 3     1 farm  zooplankton T2    0.545
# 4     1 farm  fish        T2    0.264
# 5     2 farm  zooplankton T1    0.291
# 6     2 farm  fish        T1    0.457

The argument sep = "\\." tells R to splits the character string around the period (.). We cannot type directly "." because it is a regular expression that matches any single character.

87 / 124

Recap of `tidyr`

A package that reshapes the layout of data sets.

Converting from wide to long format using pivot_longer()

Converting from long format to wide format using pivot_wider()

Split and merge columns with unite() and separate()

Data Wrangling with dplyr and tidyr Cheat Sheet

88 / 124

Challenge #1

Using the airquality dataset, gather all the columns (except Month and Day) into rows.

?airquality
data(airquality)

89 / 124

Solution #1

Using the airquality dataset, gather all the columns (except Month and Day) into rows.

air.long <- pivot_longer(airquality,
                         cols     =  c("Ozone", "Solar.R", "Wind", "Temp"), 
                         names_to = c("variable") )
head(air.long)
# # A tibble: 6 × 4
#   Month   Day variable value
#   <int> <int> <chr>    <dbl>
# 1     5     1 Ozone     41  
# 2     5     1 Solar.R  190  
# 3     5     1 Wind       7.4
# 4     5     1 Temp      67  
# 5     5     2 Ozone     36  
# 6     5     2 Solar.R  118

90 / 124

Challenge #2

Spread the resulting data frame to return to the original data format.

91 / 124

Solution #2

Spread the resulting data frame to return to the original data format.

air.wide <- pivot_wider(air.long,
                        values_from = "value",
                        names_from  = "variable")
head(air.wide)
# # A tibble: 6 × 6
#   Month   Day Ozone Solar.R  Wind  Temp
#   <int> <int> <dbl>   <dbl> <dbl> <dbl>
# 1     5     1    41     190   7.4    67
# 2     5     2    36     118   8      72
# 3     5     3    12     149  12.6    74
# 4     5     4    18     313  11.5    62
# 5     5     5    NA      NA  14.3    56
# 6     5     6    28      NA  14.9    66

92 / 124

Data manipulation with `dplyr`

93 / 124

Intro to `dplyr`

Package that contains a set of functions (or “verbs”) for data manipulation such as filtering rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data;
Easy and intuitive functions
Fast and efficient
Can interface with external databases and translate your R code into SQL queries

Some corresponding R base functions: split(), subset(), apply(), sapply(), lapply(), tapply() and aggregate()

94 / 124

Intro to `dplyr`

library(dplyr)

95 / 124

Basic functions in `dplyr`

These 4 core functions tackle the most common manipulations when working with data frames

select(): select columns from a data frame
filter(): filter rows according to defined criteria
arrange(): re-order data based on criteria (e.g. ascending, descending)
mutate(): create or transform values in a column

96 / 124

`select()` columns

select(data, ...)

... Can be column names or positions or complex expressions separated by commas

Examples:

select(data, column1, column2) # select columns 1 and 2
select(data, c(2:4,6)) # select columns 2 to 4 and 6
select(data, -column1) # select all columns except column 1
select(data, start_with(x.)) # select all columns that start with "x."

97 / 124

`select()` columns

98 / 124

`select()` columns

Example: suppose we are only interested in the variation of Ozone over time within the airquality dataset

ozone <- select(airquality, Ozone, Month, Day)
head(ozone)
#   Ozone Month Day
# 1    41     5   1
# 2    36     5   2
# 3    12     5   3
# 4    18     5   4
# 5    NA     5   5
# 6    28     5   6

99 / 124

`filter()` rows

Extract a subset of rows that meet one or more specific conditions

filter(dataframe, logical statement 1, logical statement 2, ...)

100 / 124

`filter()` rows

Example: we are interested in analyses that focus on the month of August during high temperature events

august <- filter(airquality, Month == 8, Temp >= 90)
# same as: filter(airquality, Month == 8 & Temp >= 90)
head(august)
#   Ozone Solar.R Wind Temp Month Day
# 1    89     229 10.3   90     8   8
# 2   110     207  8.0   90     8   9
# 3    NA     222  8.6   92     8  10
# 4    76     203  9.7   97     8  28
# 5   118     225  2.3   94     8  29
# 6    84     237  6.3   96     8  30

101 / 124

Sort rows with `arrange`

Re-order rows by a particular column, by default in ascending order

Use desc() for descending order.

arrange(data, variable1, desc(variable2), ...)

102 / 124

Sort rows with `arrange()`

Example:

Let us use the following code to create a scrambled version of the airquality dataset

air_mess <- sample_frac(airquality, 1)
head(air_mess)
#   Ozone Solar.R Wind Temp Month Day
# 1    23     115  7.4   76     8  18
# 2    28     273 11.5   82     8  13
# 3     8      19 20.1   61     5   9
# 4   135     269  4.1   84     7   1
# 5    23     299  8.6   65     5   7
# 6    30     322 11.5   68     5  19

103 / 124

Sort rows with `arrange()`

Example:

Now let us arrange the data frame back into chronological order, sorting by Month then Day

air_chron <- arrange(air_mess, Month, Day)
head(air_chron)
#   Ozone Solar.R Wind Temp Month Day
# 1    41     190  7.4   67     5   1
# 2    36     118  8.0   72     5   2
# 3    12     149 12.6   74     5   3
# 4    18     313 11.5   62     5   4
# 5    NA      NA 14.3   56     5   5
# 6    28      NA 14.9   66     5   6

Try : arrange(air_mess, Day, Month) and see the difference.

104 / 124

Create new columns using `mutate()`

Compute and add new columns

mutate(data, newVar1 = expression1, newVar2 = expression2, ...)

105 / 124

Create new columns using `mutate()`

Example: we want to convert the temperature variable form degrees Fahrenheit to degrees Celsius

airquality_C <- mutate(airquality,
                       Temp_C = (Temp-32)*(5/9))
head(airquality_C)
#   Ozone Solar.R Wind Temp Month Day   Temp_C
# 1    41     190  7.4   67     5   1 19.44444
# 2    36     118  8.0   72     5   2 22.22222
# 3    12     149 12.6   74     5   3 23.33333
# 4    18     313 11.5   62     5   4 16.66667
# 5    NA      NA 14.3   56     5   5 13.33333
# 6    28      NA 14.9   66     5   6 18.88889

106 / 124

`magrittr`

Usually data manipulation require multiple steps, the magrittr package offers a pipe operator %>% which allows us to link multiple operations

107 / 124

`magrittr`

library(magrittr)

108 / 124

`magrittr`

Suppose we want to analyse only the month of June, then convert the temperature variable to degrees Celsius. We can create the required data frame by combining 2 dplyr verbs we learned

june_C <- mutate(filter(airquality, 
                        Month == 6),
                 Temp_C = (Temp-32)*(5/9))

As we add more operations, wrapping functions one inside the other becomes increasingly illegible. But, step by step would be redundant and write a lot of objects to the workspace.

109 / 124

`magrittr`

Alternatively, we can use maggritr's pipe operator to link these successive operations

june_C <- airquality %>%
          filter(Month == 6) %>%
          mutate(Temp_C = (Temp-32)*(5/9))

Advantages :

less redundant code
easy to read and write because functions are executed in order

110 / 124

`group_by()` and `summarise()`

The dplyr verbs become especially powerful when they are are combined using the pipe operator %>%. The following dplyr functions allow us to split our data frame into groups on which we can perform operations individually

group_by() : group data frame by a factor for downstream operations (usually summarise)

summarise() : summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. min(), max(), mean(), etc…)

111 / 124

`dplyr` - Split-Apply-Combine

The group_by function is key to the Split-Apply-Combine strategy

112 / 124

`dplyr` - Split-Apply-Combine

113 / 124

`dplyr` - Split-Apply-Combine

Example: we are interested in the mean temperature and standard deviation within each month if the airquality dataset

month_sum <- airquality %>%
      group_by(Month) %>%
      summarise(mean_temp = mean(Temp),
                sd_temp = sd(Temp))
month_sum
# # A tibble: 5 × 3
#   Month mean_temp sd_temp
#   <int>     <dbl>   <dbl>
# 1     5      65.5    6.85
# 2     6      79.1    6.60
# 3     7      83.9    4.32
# 4     8      84.0    6.59
# 5     9      76.9    8.36

114 / 124

Challenge - `dplyr` and `magrittr`

Using the ChickWeight dataset, create a summary table which displays the difference in weight between the maximum and minimum weight of each chick in the study.

Employ dplyr verbs and the %>% operator.

## ?ChickWeight
data(ChickWeight)

115 / 124

Solution

Use group_by() to divide the dataset by Chick
Use summarise() to calculate the weight gain within each group

weight_diff <- ChickWeight %>%
               group_by(Chick) %>%
               summarise(weight_diff = max(weight) - min(weight))

head(weight_diff)
# # A tibble: 6 × 2
#   Chick weight_diff
#   <ord>       <dbl>
# 1 18              4
# 2 16             16
# 3 15             27
# 4 13             55
# 5 9              58
# 6 20             76

116 / 124

Ninja challenge

Using the ChickWeight dataset, create a summary table which displays, for each diet, the average individual difference in weight between the end and the beginning of the study.
Employ dplyr verbs and the %>% operator.

Hint: first() and last() may be useful here

117 / 124

Ninja hint

Note that we can group the data frame using more than one factor, using the general syntax as follows: group_by(group1, group2, ...)

Within group_by(), the multiple groups create a layered onion, and each subsequent single use of the summarise() function peels off the outer layer of the onion. In the above example, after we carried out a summary operation on group2, the resulting data set would remain grouped by group1 for downstream operations.

118 / 124

Ninja solution

diet_summ <- ChickWeight %>%
             group_by(Diet, Chick) %>%
             summarise(weight_gain = last(weight) - first(weight)) %>%
             group_by(Diet) %>%
             summarise(mean_gain = mean(weight_gain))

diet_summ
# # A tibble: 4 × 2
#   Diet  mean_gain
#   <fct>     <dbl>
# 1 1          115.
# 2 2          174 
# 3 3          230.
# 4 4          188.

119 / 124

R base challenge

Now try to redo the previous challenge only with functions from the R base!

Note that several solutions are possible.

Hint: the function ?aggregate() or ?by() may be useful here.

120 / 124

R base solution


weight_diff_rbase <- aggregate(weight ~ Chick, 
                               data    = ChickWeight,
                               FUN     = function(x) weight_diff = max(x) - min(x))
names(weight_diff_rbase) <- c("Chick", "weight_diff")

# Is the r base output identical to the dplyr output?
table(weight_diff_rbase ==  as.data.frame(weight_diff))
# 
# TRUE 
#  100

121 / 124

dplyr - Merging data frames

In addition to all the operations we have explored, dplyr also provides some functions that allow you to join two data frames together. The syntax in these functions is simple relative to alternatives in other R packages:

left_join()
right_join()
inner_join()
anti_join()

These are beyond the scope of the current introductory workshop, but they provide extremely useful functionality you may eventually require for some more advanced data manipulation needs.

122 / 124

More on data manipulation

Learn more on dplyr

dplyr and tidyr cheatsheet

123 / 124

Thank you for attending!

124 / 124

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Workshop 2: Loading and Manipulating Data

QCBS R Workshop Series

Québec Centre for Biodiversity Science

About this workshop

Required packages

Learning Objectives

1. Creating an R project

2. Writing a script

3. Loading, exploring, and saving data

4. Learn to manipulate data frames with tidyr, dplyr, and maggritr

Creating an R project

RStudio projects

Creating a new project

One project = one folder

Preparing data for R

Naming files

Naming variables

Common data preparation mistakes

Non-recommended data tables

Horrible data sheets

Preparing data in R

Writing a Script

R Scripts

Creating an R script

Creating an R script

Commands and comments

Why should I use #?

Header

Section Headings

Section Headings

Preparing your R Environment: Housekeeping

Preparing your R Environment: Housekeeping

Remember

Loading, exploring and saving data

Download today's data

Download today's data

Working Directory

Display contents of the directory

Importing Data

Importing Data

Importing Data

Inspecting the data

Inspecting the data

Inspecting the data

Exercise

Exercise

Reminder from Workshop 1: Accessing data

Renaming variables

Creating new variables

Creating new variables

Subsetting data

Subsetting data

Data exploration

Data exploration

Data exploration

Data exploration

The apply() function

Saving your Workspace

Exporting data

Challenge: Use your data

Database repair

Database repair

Database repair

Database repair

Database repair

Database repair

Database repair

Database repair

Database repair

?read.csv

Database repair

Database repair

Database repair

Database repair

Database repair

Database repair

Welcome to the tidyverse world!

What is Tidyverse?

Learn to manipulate data with

tidyr, dyplr, and maggritr

4. Learn to manipulate data frames with `tidyr`, `dplyr`, and `maggritr`

Preparing data in `R`

`R` Scripts

Creating an `R` script

Creating an `R` script

Why should I use `#`?

Preparing your `R` Environment: Housekeeping

Preparing your `R` Environment: Housekeeping

The `apply()` function

`?read.csv`

Using `tidyr` to reshape data frames

`tidyr` installation

Example of `pivot_longer`

Disperser avec `pivot_wider()`

A `tibble` structure for your data

A `tibble` structure for your data

`separate` columns

Using `separate()` example

Using `separate()` example

Using `separate()` example

Recap of `tidyr`

Data manipulation with `dplyr`

Intro to `dplyr`

Intro to `dplyr`

Basic functions in `dplyr`

`select()` columns

`select()` columns

`select()` columns

`filter()` rows

`filter()` rows

Sort rows with `arrange`

Sort rows with `arrange()`

Sort rows with `arrange()`

Create new columns using `mutate()`

Create new columns using `mutate()`

`magrittr`

`magrittr`

`magrittr`

`magrittr`

`group_by()` and `summarise()`

`dplyr` - Split-Apply-Combine

`dplyr` - Split-Apply-Combine

`dplyr` - Split-Apply-Combine

Challenge - `dplyr` and `magrittr`