Workshop 2: Loading and Manipulating Data

class: center, middle, inverse, title-slide

# Workshop 2: Loading and Manipulating Data
## QCBS R Workshop Series
### Qu?bec Centre for Biodiversity Science

---

## Learning Objectives

1. Creating an R project
2. Writing a script
3. Loading, exploring and saving data
4. Learn to manipulate data frames with tidyr, dplyr, maggritr

---
## RStudio projects

* What is this?
  - Within RStudio, Projects make it easy to separate and keep your work organized.  
  - All files, scripts, documentation related to a specific project are bound together

* Encourages reproducibility and easy sharing.

---
## Create a new project

Use the **Create project** command (available in the Projects menu and the global toolbar)

.center[
![:scale 60%](images/0_Create_a_new_project.png)]

---
## Keep your files organized!

One project = one folder :)

.center[
![:scale 60%](images/0_folderData1.png)]

---
## Preparing data for R

Dataset should be stored as comma-separated value files (.csv) in the working directory

* Can be created from almost all applications (Excel, LibreOffice, GoogleDocs)
* File -> Save as **.csv**…

.center[
![:scale 45%](images/Excel_csv_good_bad/excel_vs_csv.png)
]

---
## Choose filenames names wisely

.pull-left[
**Good!**
- rawDatasetAgo2017.csv
- co2_concentrations_QB.csv
- 01_figIntro.R]

.pull-right[
**Bad!**
- final.csv (*Uninformative!*)
- safnnejs.csv (*Random!*)
- 1-4.csv (*Avoid using only numbers!*)
- Dont.separate.names.with.dots.csv (*Because it may lead to reading file errors!*)]

---
## Choose variable names wisely

.pull-left[
**Good!**
  - Measurements
  - SpeciesNames
  - Site ]

.pull-right[
**Bad!**
 - a
 - 3
 - supercomplicatedverylongname]

---
## Things to consider with your data input

.pull-left[
* No text in numeric columns
* Do not include spaces!
* NA (not available) is allowed
* Blank entries will automatically be replaced with NA
* Name your variables informatively
* Look for typos]

.pull-right[
* Avoid numeric values for data that are not numeric
* Use CONSISTENT formats for dates, numbers, metrics, etc.
* No notes in .CSV files
* No additional headings
* No merged cells]

---
## Bad data examples:

.pull-left[
![:scale 120%](images/Excel_csv_good_bad/excel_values.png)]

.pull-right[
![:scale 120%](images/Excel_csv_good_bad/excel_notes.png)
]

---
## Horrible data examples:

Bad data preservation practices

.center[
![:scale 85%](images/HorribleData.png)]

---
## Preparing data in R

It is possible to do all your data preparation work within R

* Saves time for large datasets
* Keeps original data intact
* Keeps track of the manipulation and transformation you did
* Can switch between long- and wide-formats easily (more on this in future workshops)

* Useful tips on data preparation can be found here: https://www.zoology.ubc.ca/~schluter/R/data/

---
class: inverse, center, middle

# Writing a Script

---
## R Scripts

* What is this?
   - A text file that contains all of the commands that you will use

* Once written and saved, your script file allows you to make changes and re-run analyses with minimal effort!
  - Just highlight text and click "Run" or press Command + Enter (Apple) or Ctrl + Enter (Windows and Linux)

---
## Create an R script

.center[
![:scale 80%](images/1_Create_an_R_script.png)
]

---
## Create an R script

.center[
![:scale 90%](images/2_Create_an_R_script2.Mod_arrow.png)
]

---
## Commands and comments
The `#` symbol in a script tells R to ignore anything remaining on this line of the script when running commands

```r
# This is a comment not a command
```

---
## Why should I use `#`?

* Annotating someone’s script is a good way to learn
* Remember what you did
* Tell collaborators what you did
* Good step towards reproducible science
* Be as detailed as possible

---
## Header

It is recommended that you start your script with a header using comments:

.center[
![:scale 75%](images/3_Header.Mod_arrow.png)
]

---
## Section Headings

You can make a section heading in R Studio four `#` signs

```r
# You can coment using this, but look below how to create section headers:

## Heading name ####
```

This allows you to move quickly between sections and hide sections

.center[
![:scale 70%](images/4_Section_headings_Mod_arrow 2.png)
]

---
## Housekeeping

The first command at the top of all scripts may be: `rm(list = ls())`.
This command:

* Clears R memory
* Prevents errors related to the use of older data

---
## Clearing the workspace

```r
# Clear the R workspace
rm(list = ls())
?rm
?ls
```

---
## Housekeeping

Demo – Try to add some test data to R and then see how `rm(list = ls())` removes it

```r
A<-"Test" # Put some data in workspace
A <- "Test" # Note that you can use spaces!
A = "Test" # <- or = can be used equally

#Note that it is best practice to use "<-" for assigment instead of "="
A
# [1] "Test"
rm(list=ls())
A
# Error in eval(expr, envir, enclos): object 'A' not found
```

---
## Remember

* R is ready for commands when you see the chevron `>` in the console. If you don't see it, press ESC
* R is case sensitive!

---
class: inverse, center, middle

# Loading, exploring and saving data

---
## Download today's data

You can download the data and the script for this workshop from the wiki:

.center[http://qcbs.ca/wiki/r/workshop2]

Save the files in the folder where your created your R project.

---
## Working Directory

Tells R where your scripts and data are. You need to set the right working directory to load a data file. Type `getwd()` in the console to see your working directory:

```r
getwd()
```

If this is not the directory you would like to work with, you can set your own using:

```r
setwd()
```

Within the parenthesis, you will write the *extension of the directory* you would like to work with. See the example below:

```r
setwd("C:/Users/Luigi/Documents") # We use slashes "/", and not backslashes "\"
```

---
## Display contents of the directory

You can display contents of the working directory using `dir()`:

```r
dir()
# [1] "co2_broken.csv"         "co2_good.csv"          
# [3] "images"                 "qcbsR-fonts.css"       
# [5] "qcbsR-header.html"      "qcbsR-macros.js"       
# [7] "qcbsR.css"              "script_workshop02-en.R"
# [9] "workshop02-en.Rmd"
```

It helps to:

* Check that the file you plan to open is present in the folder that R is currently working in
* Check for correct spelling (e.g. "co2_good.csv" instead of "CO2_good.csv")

---
## Importing Data

Import data into R using `read.csv`:

```r
CO2 <- read.csv("co2_good.csv", header=TRUE)
```

Note that this will:
* Create a new object in R called `CO2`;
* The filename is written within quotation marks ('file' or "file");
* If you needed to fetch the filename from another directory that is not yours, you would have to write the full extension (e.g., `"C:/Users/Mario/Downloads/co2_good.csv"`)
* `header=TRUE` tells R that the first line of your dataset contains column names

---
## Importing Data

.alert[Recall] to find out what arguments the function requires, use help “?”
I

```r
?read.csv
```

.alert[Note] that if your operating system or .CSV editor (e.g. Excel) is in French, you may have to use `read.csv2`

```r
?read.csv2
```

---
## Importing Data

.center[
![](images/5_Importing_data_Mod_arrow.png)
]

Notice that RStudio now provides information on the CO2 data in your **workspace**.

The **workspace** refers to all the objects that you create during an R session.

---
## Looking at Data

R Command  					| Action
----------------- | -------------
`CO2`					  		| look at the whole dataframe
`head(CO2)`		  		| look at the first few rows
`tail(CO2)`         | look at the last few rows
`names(CO2)`				| names of the columns in the dataframe
`attributes(CO2)`		| attributes of the dataframe
`dim(CO2)`          | dimensions of the dataframe
`ncol(CO2)`					| number of columns
`nrow(CO2)`					| number of rows

---
## Inspecting the data

```r
str(CO2)
# 'data.frame':	84 obs. of  5 variables:
#  $ Plant    : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...
#  $ Type     : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...
#  $ Treatment: Factor w/ 2 levels "chilled","nonchilled": 2 2 2 2 2 2 2 2 2 2 ...
#  $ conc     : int  95 175 250 350 500 675 1000 95 175 250 ...
#  $ uptake   : num  16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
```
This shows the structure of the dataframe.
Very useful to check data type (mode) of all columns to make sure R loaded data properly.

.small[Note: the CO2 dataset includes repeated measurements of CO2 uptake from 6 plants from Quebec and 6 plants from Mississippi at several levels of CO2 concentration. Half the plants of each type were chilled overnight before the experiment was conducted.]

---
## Inspecting the data

**Common problems:**

* Factors loaded as text (character) or vice versa;
* Factors includes too many (and unexpected) levels because of typos
* Data (integer or numeric) is loaded as character because of typos (*e.g.* a space or a "," instead of a "." to separate decimal numbers)

---
## Looking at Data

.pull-left[

.pull-left[
`data()`

`head()`
]
.pull-right[

`tail()`

`str()`
]
]

.pull-right[
.pull-left[
`names()`

`attributes()`
]
.pull-right[

`dim()`

`ncol()`

`nrow()`
]
]

Load the data with:

```r
CO2 <- read.csv("co2_good.csv", header = FALSE)
```

Check data types with `str()` again. What is wrong here?
.comment[Do not forget to re-load data with* `header = T` *afterwards]

---
## Reminder from Workshop 1: Accessing data

Imagine a data frame called `mydata`:

.center[
![:scale 45%](images/Table_Reminder_from_Workshop_1_Accessing_data.png)
]

```r
mydata[1,] # Extracts the first row
mydata[2,3] # Extracts the content of row 2 / column 3
mydata[,1] # Extracts the first column
mydata[,1][2] # [...] can be also be used recursively
mydata$Variable1 # Also extracts the first column
```

---
## Renaming variables

Variables names can be changed within R.

```r
# First lets make a copy of the dataset to play with
CO2copy <- CO2
# names() gives you the names of the variables present in the data frame
names(CO2copy)
# [1] "Plant"     "Type"      "Treatment" "conc"      "uptake"

# Changing from English to French names (make sure you have the same levels!)
names(CO2copy) <- c("Plante","Categorie", "Traitement", "conc", "absortion")
names(CO2copy)
# [1] "Plante"     "Categorie"  "Traitement" "conc"       "absortion"
```

---
## Creating new variables

Variables and strings can be concatenated together.
The function `paste()` is very useful for concatenating.
See `?paste` and `?paste0`.

```r
# Let's create an unique id for our samples:
# Don't forget to use "" for strings
CO2copy$uniqueID <- paste0(CO2copy$Plante,
                           "_",CO2copy$Categorie,
                           "_", CO2copy$Traitement)

# observe the results
head(CO2copy$uniqueID)
# [1] "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled"
# [4] "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled"
```

---
## Creating new variables

Creating new variables works for numbers and mathematical operations as well!

```r
# Let's standardize our variable "absortion" to relative values
CO2copy$absortionRel <- CO2copy$absortion/max(CO2copy$absortion)

# Observe the results
head(CO2copy$absortionRel)
```

---
## Subsetting data

There are many ways to subset a data frame

```r
# Let's keep working with our CO2copy data frame

# Select only "Plante" and "absortionRel" columns. (Don't forget the ","!)
CO2copy[,c("Plante", "absortionRel")]

# Subset data frame from rows from 1 to 50
CO2copy[1:50,]
```

---
## Subsetting data

```r
# Select observations matching only the nonchilled Traitement.
CO2copy[CO2copy$Traitement == "nonchilled",]

# Select observations with absortion higher or equal to 20
CO2copy[CO2copy$absortion >= 20, ]

# Select observations with absortion higher or equal to 20
CO2copy[CO2copy$Traitement  == "nonchilled" & CO2copy$absortion >= 20, ]

# We are done playing with the Dataset copy, lets erase it.
CO2copy <- NULL
```

Go [here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Logic.html) to check all the logical operators you can use

---
## Data exploration

A good way to start your data exploration is to look at some basic statistics of your dataset.

Use the `summary()` function to do that!

```r
summary(CO2)
```

This is also useful to spot some errors you might have missed!

---
## Data exploration

You can also use other functions to calculate basic statistics on parts of your data frame. Let's try the `mean()`, `sd()` and `hist()` functions:

```r
# Calculate the mean and the standard deviation of the CO2 concentration:
# Assign them to new variables
meanConc <- mean(CO2$conc)
sdConc <- sd(CO2$conc)
# print() prints any given value to the R console
print(paste("the mean of concentration is:", meanConc))
print(paste("the standard deviation of concentration is:", sdConc))
```

```r
# Let's plot a histogram to explore the distribution of "uptake"
hist(CO2$uptake)

# Increasing the number of bins to observe better the pattern
hist(CO2$uptake, breaks = 40)
```

---
## Data exploration

Use `apply()` to calculate the means of the last two columns of the data frame (*i.e.* the columns that contain continuous data).

```r
?apply # Let's see how apply works!
```

```r
apply(CO2[,4:5],
      MARGIN = 2,
      FUN = mean)
#     conc   uptake 
# 435.0000  27.2131
```

---
## Saving your Workspace

```r
# Saving an R workspace file that stores all your objects
save.image(file="co2_project_Data.RData")

# Clear your memory
rm(list = ls())

# Reload your data
load("co2_project_Data.RData")
head(CO2) # Looking good!
```

```
#   Plant   Type  Treatment conc uptake
# 1   Qn1 Quebec nonchilled   95   16.0
# 2   Qn1 Quebec nonchilled  175   30.4
# 3   Qn1 Quebec nonchilled  250   34.8
# 4   Qn1 Quebec nonchilled  350   37.2
# 5   Qn1 Quebec nonchilled  500   35.3
# 6   Qn1 Quebec nonchilled  675   39.2
```

---
## Exporting data

R disposes of `write` functions that allow you to write objects directly to files in your computer. Let us use the `write.csv` function to save our CO2 data into a .CSV file:

```r
write.csv(CO2, file = "co2_new.csv")
```

Note that our arguments are both:

`CO2` ![:faic](arrow-right) Object (name)

`"co2_new.csv"` ![:faic](arrow-right) File to write (name)

---
## Challenge: Use your data ![:cube]()

* Try to load, explore, plot, and save your own data in R;*
* If it does not load properly, try to make the appropriate changes;
* When you are finished, try opening your exported data in Excel, Numbers, Notepad or TextEdit.

.comment[*If you don’t have your own data, work with your neighbour]
.comment[*Remember to clean your workspace]

---
## Fixing a broken data frame

Getting your data working properly can be tougher than you think!

Sometimes, one may find **compatibility issues**.

For example, sharing data from an Apple computer to Windows, or between computers set up in different continents can lead to incompatible files (e.g. different decimal separators).

Let's practice how to solve these common errors.

---
## Harder Challenge![:cube]()

Read the file `co2_broken.csv`

```r
CO2 <- read.csv("co2_broken.csv")
head(CO2)
#               NOTE..It.rain.a.lot.in.Quebec.during.sampling
# 1 falling on my notebook numerous values can't be read rain
# 2                      Plant\tType\tTreatment\tconc\tuptake
# 3                           Qn1\tQuebec\tnonchilled\t95\t16
# 4                        Qn1\tQuebec\tnonchilled\t175\t30.4
# 5           Qn1\tQuebec\tnonchilled\t250\tcannot_read_notes
# 6                        Qn1\tQuebec\tnonchilled\t350\t37.2
#   due.to.excessive  X X.1 X.2 X.3
# 1               NA NA  NA  NA  NA
# 2               NA NA  NA  NA  NA
# 3               NA NA  NA  NA  NA
# 4               NA NA  NA  NA  NA
# 5               NA NA  NA  NA  NA
# 6               NA NA  NA  NA  NA
```

---
## Harder Challenge ![:cube]()

* This is probably what the data you are using looks like;
* You can fix the data frame in R or other software;
* Please give it a try before looking at the script provided;
* Work with your neighbours and have FUN!.

```r
CO2[1:4,]
#               NOTE..It.rain.a.lot.in.Quebec.during.sampling
# 1 falling on my notebook numerous values can't be read rain
# 2                      Plant\tType\tTreatment\tconc\tuptake
# 3                           Qn1\tQuebec\tnonchilled\t95\t16
# 4                        Qn1\tQuebec\tnonchilled\t175\t30.4
#   due.to.excessive  X X.1 X.2 X.3
# 1               NA NA  NA  NA  NA
# 2               NA NA  NA  NA  NA
# 3               NA NA  NA  NA  NA
# 4               NA NA  NA  NA  NA
```

---
## Fixing co2_broken ![:cube]()

Some useful functions:

* `?read.csv` - look at some of the options for how to load a .csv
* `head()` - first few rows
* `str()` - structure of data
* `class()` - class of the object
* `unique()` - unique observations
* `levels()` - levels of a factor
* `which()` - ask a question to your data frame
* `droplevels()` - get rid of undesired levels after subsetting factors

.alert[HINT] There are 4 problems!

---
## Broken data

.alert[ERROR 1] The data appears to be lumped into one column

```r
head(CO2)
#               NOTE..It.rain.a.lot.in.Quebec.during.sampling
# 1 falling on my notebook numerous values can't be read rain
# 2                      Plant\tType\tTreatment\tconc\tuptake
# 3                           Qn1\tQuebec\tnonchilled\t95\t16
# 4                        Qn1\tQuebec\tnonchilled\t175\t30.4
# 5           Qn1\tQuebec\tnonchilled\t250\tcannot_read_notes
# 6                        Qn1\tQuebec\tnonchilled\t350\t37.2
#   due.to.excessive  X X.1 X.2 X.3
# 1               NA NA  NA  NA  NA
# 2               NA NA  NA  NA  NA
# 3               NA NA  NA  NA  NA
# 4               NA NA  NA  NA  NA
# 5               NA NA  NA  NA  NA
# 6               NA NA  NA  NA  NA
```

---
## Broken data

.alert[ERROR 1] - Solution

* Re-import the data, but specify the separation among entries
* The sep argument tells R what character separates the values on each line of the file
* Here, "TAB" was used instead of ","

```r
CO2 <- read.csv("co2_broken.csv",sep = "")
```

---
## Broken data

.alert[ERROR 2] The data does not start until the third line of the file, so you end up with notes on the file as the headings.

```r
head(CO2)
#     NOTE.     It       rain        a               lot    in. Quebec
# 1 falling     on         my notebook          numerous values  can't
# 2   Plant   Type  Treatment     conc            uptake              
# 3     Qn1 Quebec nonchilled       95                16              
# 4     Qn1 Quebec nonchilled      175              30.4              
# 5     Qn1 Quebec nonchilled      250 cannot_read_notes              
# 6     Qn1 Quebec nonchilled      350              37.2              
#   during sampling.      due to excessive X....
# 1     be      read rain,,,, NA        NA    NA
# 2                           NA        NA    NA
# 3                           NA        NA    NA
# 4                           NA        NA    NA
# 5                           NA        NA    NA
# 6                           NA        NA    NA
```

---
## Broken data

.alert[ERROR 2] - Solution

Skip two lines when loading the file using the "skip" argument:

```r
CO2 <- read.csv("co2_broken.csv", sep = "", skip = 2)
head(CO2)
#   Plant   Type  Treatment              conc            uptake
# 1   Qn1 Quebec nonchilled                95                16
# 2   Qn1 Quebec nonchilled               175              30.4
# 3   Qn1 Quebec nonchilled               250 cannot_read_notes
# 4   Qn1 Quebec nonchilled               350              37.2
# 5   Qn1 Quebec nonchilled               500              35.3
# 6   Qn1 Quebec nonchilled cannot_read_notes              39.2
```

---
## Broken data

.alert[ERROR 3]

`conc` and `uptake` variables are considered factors instead of numbers, because there are comments in the numeric columns

```r
str(CO2)
# 'data.frame':	84 obs. of  5 variables:
#  $ Plant    : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...
#  $ Type     : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...
#  $ Treatment: Factor w/ 4 levels "chiled","chilled",..: 4 4 4 4 4 4 4 4 4 3 ...
#  $ conc     : Factor w/ 8 levels "1000","175","250",..: 7 2 3 4 5 8 1 7 2 3 ...
#  $ uptake   : Factor w/ 77 levels "10.5","10.6",..: 15 39 76 54 50 61 63 9 32 53 ...
```

* Due to missing values entered as "cannot_read_notes" and "na"
* Recall that R only recognizes "NA" (capital)

---
## `?read.csv`

![:scale 80%](images/read_table_help1.png)

![:scale 80%](images/read_table_help2.png)

---
## Broken data

.alert[ERROR 3] - Solution

Tell R that all of NA, "na", and "cannot_read_notes" should be considered NA. Then because all other values in those columns are numbers, `conc` and `uptake` will be loaded as numeric/integer.

```r
CO2 <- read.csv("co2_broken.csv", sep = "", skip = 2,
                na.strings = c("NA","na","cannot_read_notes"))
str(CO2)
# 'data.frame':	84 obs. of  5 variables:
#  $ Plant    : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...
#  $ Type     : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...
#  $ Treatment: Factor w/ 4 levels "chiled","chilled",..: 4 4 4 4 4 4 4 4 4 3 ...
#  $ conc     : int  95 175 250 350 500 NA 1000 95 175 250 ...
#  $ uptake   : num  16 30.4 NA 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
```

---
## Broken data

.alert[ERROR 4]

There are only 2 treatments (chilled and nonchilled) but there are spelling errors causing it to look like 4 different treatments.

```r
str(CO2)
```

```r
levels(CO2$Treatment)
# [1] "chiled"     "chilled"    "nnchilled"  "nonchilled"
unique(CO2$Treatment)
# [1] nonchilled nnchilled  chilled    chiled    
# Levels: chiled chilled nnchilled nonchilled
```

---
## Broken data

.alert[ERROR 4] - Solution

```r
# Identify all rows that contain "nnchilled" and replace with "nonchilled"
CO2$Treatment[CO2$Treatment=="nnchilled"] <- "nonchilled"

# Identify all rows that contain "chiled" and replace with "chilled"
CO2$Treatment[CO2$Treatment=="chiled"] <- "chilled"

# Drop unused levels from factor
CO2 <- droplevels(CO2)
str(CO2)
# 'data.frame':	84 obs. of  5 variables:
#  $ Plant    : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...
#  $ Type     : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...
#  $ Treatment: Factor w/ 2 levels "chilled","nonchilled": 2 2 2 2 2 2 2 2 2 2 ...
#  $ conc     : int  95 175 250 350 500 NA 1000 95 175 250 ...
#  $ uptake   : num  16 30.4 NA 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
```

Fixed!

---
class: inverse, center, middle

# Learn to manipulate data with tidyr, dyplr, maggritr

---
## Using `tidyr` to reshape data frames

```r
library(tidyr)
```

---
## Data formats

.pull-left[
**Wide format**

```
#   Species DBH Height
# 1     Oak  12     56
# 2     Elm  20     85
# 3     Ash  13     55
```
]

.pull-right[
**Long format**

```
#   Species Measurement Value
# 1     Oak         DBH    12
# 2     Elm         DBH    20
# 3     Ash         DBH    13
# 4     Oak      Height    56
# 5     Elm      Height    85
# 6     Ash      Height    55
```
]

---
## long vs wide format

.pull-left[
**Wide** data format has a separate column for each variable or each factor in your study]

.pull-right[
**Long** data format has a column for possible variables and a column for the values of those variables]

------------

Wide data frame can be used for some basic plotting in `ggplot2`, but more complex plots require long format (example to come)

`dplyr`, `lm()`, `glm()`, `gam()` all require long data format

---
## Tidying your data

Tidying allows you to manipulate the structure of your data while preserving all original information

`gather()` - convert from wide to long format

`spread()` - convert from long to wide format

.center[
![scale:90%](images/gather-spread.png)]

---
## `tidyr` installation

```r
install.packages("tidyr")
library(tidyr)
```

---
## `gather` columns into rows

`gather(data, key, value, ...)`

<span style="font-size: 21px;">
  - `data` A data frame (e.g. `wide`)   
  - `key` name of the new column containing variable names (e.g. `Measurement`)  
  - `value` name of the new column containing variable values (e.g. `Value`)   
  - `...` name or numeric index of the columns we wish to gather (e.g. `DBH`, `Height`)
</span>

---
## `gather` columns into rows

```r
wide <- data.frame(Species = c("Oak", "Elm", "Ash"),
                   DBH = c(12, 20, 13), Height = c(56, 85, 55))
wide
#   Species DBH Height
# 1     Oak  12     56
# 2     Elm  20     85
# 3     Ash  13     55
```

```r
long = gather(wide, Measurement, Value, DBH, Height)
long
#   Species Measurement Value
# 1     Oak         DBH    12
# 2     Elm         DBH    20
# 3     Ash         DBH    13
# 4     Oak      Height    56
# 5     Elm      Height    85
# 6     Ash      Height    55
```

---
## `spread` rows into columns

`spread(data, key, value)`

<span style="font-size: 21px;">
  - `data` A data frame (e.g. `long`)   
  - `key` Name of the column containing variable names (e.g. `Measurement`)  
  - `value` Name of the column containing variable values (e.g. `Value`)   
</span>

---
## `spread` rows into columns

```r
long
#   Species Measurement Value
# 1     Oak         DBH    12
# 2     Elm         DBH    20
# 3     Ash         DBH    13
# 4     Oak      Height    56
# 5     Elm      Height    85
# 6     Ash      Height    55
wide2 = spread(long, Measurement, Value)
wide2
#   Species DBH Height
# 1     Ash  13     55
# 2     Elm  20     85
# 3     Oak  12     56
```

---
## `separate` columns

`separate()` splits a columns by a character string separator

`separate(data, col, into, sep)`

- `data` A data frame (e.g. `long`)   
  - `col` Name of the column you wish to separate
  - `into` Names of new variables to create
  - `sep` Character which indicates where to separate

---
## Using `separate()` example

Create a fictional dataset about fish and plankton

```r
set.seed(8)
messy <- data.frame(id = 1:4,
                    trt = sample(rep(c('control', 'farm'), each = 2)),
                    zooplankton.T1 = runif(4),
                    fish.T1 = runif(4),
                    zooplankton.T2 = runif(4),
                    fish.T2 = runif(4))
messy
#   id     trt zooplankton.T1    fish.T1 zooplankton.T2   fish.T2
# 1  1    farm      0.7189275 0.64449114    0.544962116 0.2644589
# 2  2    farm      0.2908734 0.45704489    0.138224346 0.2765322
# 3  3 control      0.9322698 0.08930101    0.927812252 0.5211070
# 4  4 control      0.7691470 0.43239137    0.001301721 0.2236889
```

---
## Using `separate()` example

First convert the messy data frame from wide to long format

```r
messy.long <- gather(messy, taxa, count, -id, -trt)
head(messy.long)
#   id     trt           taxa     count
# 1  1    farm zooplankton.T1 0.7189275
# 2  2    farm zooplankton.T1 0.2908734
# 3  3 control zooplankton.T1 0.9322698
# 4  4 control zooplankton.T1 0.7691470
# 5  1    farm        fish.T1 0.6444911
# 6  2    farm        fish.T1 0.4570449
```

---
## Using `separate()` example

Then we want to split the 2 sampling time (T1 and T2).

```r
messy.long.sep <- separate(messy.long, taxa,
                           into = c("species", "time"), sep = "\\.")
head(messy.long.sep)
#   id     trt     species time     count
# 1  1    farm zooplankton   T1 0.7189275
# 2  2    farm zooplankton   T1 0.2908734
# 3  3 control zooplankton   T1 0.9322698
# 4  4 control zooplankton   T1 0.7691470
# 5  1    farm        fish   T1 0.6444911
# 6  2    farm        fish   T1 0.4570449
```

<span style="font-size: 18px;">
The argument `sep = "\\."` tells R to splits the character string around the period (.). We cannot type directly `"."` because it is a regular expression that matches any single character.
</span>

---
## Recap of `tidyr`

A package that reshapes the layout of data sets.

Converting from wide to long format using `gather()`

Converting from long format to wide format using `spread()`

Split and merge columns with `unite()` and `separate()`

[Data Wrangling with dplyr and tidyr Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

---
## Challenge with `tidyr` ![:cube]()

1. Using the airquality dataset, `gather` all the columns (except Month and Day) into rows.

2. Then `spread` the resulting data frame to return to the original data format.

```r
?airquality

data(airquality)
```

---
## Solution

1. Using the airquality dataset, `gather` all the columns (except Month and Day) into rows.

```r
air.long <- gather(airquality, variable, value, -Month, -Day)
head(air.long)
#   Month Day variable value
# 1     5   1    Ozone    41
# 2     5   2    Ozone    36
# 3     5   3    Ozone    12
# 4     5   4    Ozone    18
# 5     5   5    Ozone    NA
# 6     5   6    Ozone    28
```
<span style="font-size: 18px;">
Note that the syntax used here indicates that we wish to gather ALL the columns exept *Month* and *Day*. It is equivalent to: `gather(airquality, value, Ozone, Solar.R, Temp, Wind)`
</span>

---
## Solution

2. Then `spread` the resulting data frame to return to the original data format.

```r
air.wide <- spread(air.long, variable, value)
head(air.wide)
#   Month Day Ozone Solar.R Temp Wind
# 1     5   1    41     190   67  7.4
# 2     5   2    36     118   72  8.0
# 3     5   3    12     149   74 12.6
# 4     5   4    18     313   62 11.5
# 5     5   5    NA      NA   56 14.3
# 6     5   6    28      NA   66 14.9
```

---
## Data manipulation with `dplyr`

---
## Intro to `dplyr`

- Package that contains a set of functions (or “verbs”) for data manipulation such as filtering rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data;
- Easy and intuitive functions
- Fast and efficient
- Can interface with external databases and translate your R code into SQL queries

Some corresponding R base functions:
`split()`, `subset()`, `apply()`, `sapply()`, `lapply()`, `tapply()` and `aggregate()`

---
## Intro to `dplyr`

```r
install.packages("dplyr")
library(dplyr)
```

---
## Basic functions in `dplyr`

These 4 core functions tackle the most common manipulations when working with data frames

* `select()`: select columns from a data frame
  * `filter()`: filter rows according to defined criteria
  * `arrange()`: re-order data based on criteria (e.g. ascending, descending)
  * `mutate()`: create or transform values in a column

---
## `select` columns

`select(data, ...)`

* `...`  Can be column names or positions or complex expressions separated by commas

`select(data, column1, column2)` select columns 1 and 2
  `select(data, c(2:4,6))` select columns 2 to 4 and 6
  `select(data, -column1)` select all columns except column 1
  `select(data, start_with(x.))` select all columns that start with "x."

---
## `select` columns

---
## `select` columns

Example: suppose we are only interested in the variation of Ozone over time within the airquality dataset

```r
ozone <- select(airquality, Ozone, Month, Day)
head(ozone)
#   Ozone Month Day
# 1    41     5   1
# 2    36     5   2
# 3    12     5   3
# 4    18     5   4
# 5    NA     5   5
# 6    28     5   6
```

---
## `filter` rows

Extract a subset of rows that meet one or more specific conditions

`filter(dataframe, logical statement 1, logical statement 2, ...)`

---
## `filter` rows

Example: we are interested in analyses that focus on the month of August during high temperature events

```r
august <- filter(airquality, Month == 8, Temp >= 90)
# same as: filter(airquality, Month == 8 & Temp >= 90)
head(august)
#   Ozone Solar.R Wind Temp Month Day
# 1    89     229 10.3   90     8   8
# 2   110     207  8.0   90     8   9
# 3    NA     222  8.6   92     8  10
# 4    76     203  9.7   97     8  28
# 5   118     225  2.3   94     8  29
# 6    84     237  6.3   96     8  30
```

---
## Sort rows with `arrange`

Re-order rows by a particular column, by default in ascending order

Use `desc()` for descending order.

`arrange(data, variable1, desc(variable2), ...)`

---
## Sort rows with `arrange`

Example:
1. Let's use the following code to create a scrambled version of the airquality dataset

```r
air_mess <- sample_frac(airquality, 1)
head(air_mess)
#   Ozone Solar.R Wind Temp Month Day
# 1    23     115  7.4   76     8  18
# 2    28     273 11.5   82     8  13
# 3     8      19 20.1   61     5   9
# 4   135     269  4.1   84     7   1
# 5    23     299  8.6   65     5   7
# 6    30     322 11.5   68     5  19
```

---
## Sort rows with `arrange`

Example:
2. Now let's arrange the data frame back into chronological order, sorting by Month then Day

```r
air_chron <- arrange(air_mess, Month, Day)
head(air_chron)
#   Ozone Solar.R Wind Temp Month Day
# 1    41     190  7.4   67     5   1
# 2    36     118  8.0   72     5   2
# 3    12     149 12.6   74     5   3
# 4    18     313 11.5   62     5   4
# 5    NA      NA 14.3   56     5   5
# 6    28      NA 14.9   66     5   6
```

Try :  `arrange(air_mess, Day, Month)` and see the difference.

---
## Create new columns using `mutate`

Compute and add new columns

`mutate(data, newVar1 = expression1, newVar2 = expression2, ...)`

---
## Create new columns using `mutate`

Example: we want to convert the temperature variable form degrees Fahrenheit to degrees Celsius

```r
airquality_C <- mutate(airquality, Temp_C = (Temp-32)*(5/9))
head(airquality_C)
#   Ozone Solar.R Wind Temp Month Day   Temp_C
# 1    41     190  7.4   67     5   1 19.44444
# 2    36     118  8.0   72     5   2 22.22222
# 3    12     149 12.6   74     5   3 23.33333
# 4    18     313 11.5   62     5   4 16.66667
# 5    NA      NA 14.3   56     5   5 13.33333
# 6    28      NA 14.9   66     5   6 18.88889
```

---
## `magrittr`

Usually data manipulation require multiple steps, the magrittr package offers a pipe operator `%>%` which allows us to link multiple operations

---
## `magrittr`

```r
install.packages("magrittr")
require(magrittr)
```

---
## `magrittr`

Suppose we want to analyse only the month of June, then convert the temperature variable to degrees Celsius. We can create the required data frame by combining 2 dplyr verbs we learned

```r
june_C <- mutate(filter(airquality, Month == 6),
                 Temp_C = (Temp-32)*(5/9))
```

As we add more operations, wrapping functions one inside the other becomes increasingly illegible. But, step by step would be redundant and write a lot of objects to the workspace.

---
## `magrittr`

Alternatively, we can use maggritr's pipe operator to link these successive operations

```r
june_C <- airquality %>%
    filter(Month == 6) %>%
    mutate(Temp_C = (Temp-32)*(5/9))
```

Advantages :

* less redundant code
  * easy to read and write because functions are executed in order

---
## `dplyr::group_by` and `summarise`

The `dplyr` verbs become especially powerful when they are are combined using the pipe operator `%>%`.
The following `dplyr` functions allow us to split our data frame into groups on which we can perform operations individually

`group_by()` : group data frame by a factor for downstream operations (usually summarise)

`summarise()` : summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. `min()`, `max()`, `mean()`, etc…)

---
## `dplyr` - Split-Apply-Combine

The `group_by` function is key to the Split-Apply-Combine strategy

---
## `dplyr` - Split-Apply-Combine

---
## `dplyr` - Split-Apply-Combine

Example: we are interested in the mean temperature and standard deviation within each month if the airquality dataset

```r
month_sum <- airquality %>%
      group_by(Month) %>%
      summarise(mean_temp = mean(Temp),
                sd_temp = sd(Temp))
month_sum
# # A tibble: 5 x 3
#   Month mean_temp sd_temp
#   <int>     <dbl>   <dbl>
# 1     5      65.5    6.85
# 2     6      79.1    6.60
# 3     7      83.9    4.32
# 4     8      84.0    6.59
# 5     9      76.9    8.36
```

---
## Challenge with `dplyr`and `magrittr` ![:cube]()

Using the `ChickWeight` dataset, create a summary table which displays the difference in weight between the maximum and minimum weight of each chick in the study.

Employ `dplyr` verbs and the `%>%` operator.

---
## Solution

1. Use group_by() to divide the dataset by "Chick"
2. Use summarise() to calculate the weight gain within each group

```r
weight_diff <- ChickWeight %>%
      group_by(Chick) %>%
      summarise(weight_diff = max(weight) - min(weight))
head(weight_diff)
# # A tibble: 6 x 2
#   Chick weight_diff
#   <ord>       <dbl>
# 1 18              4
# 2 16             16
# 3 15             27
# 4 13             55
# 5 9              58
# 6 20             76
```

---
## Ninja challenge

Using the `ChickWeight` dataset, create a summary table which displays, for each diet, the average individual difference in weight between the end and the beginning of the study.

Employ `dplyr` verbs and the `%>%` operator.

(Hint: `first()` and `last()` may be useful here.)

---
## Ninja solution

```r
diet_summ <- ChickWeight %>%
      group_by(Diet, Chick) %>%
      summarise(weight_gain = last(weight) - first(weight)) %>%
      group_by(Diet) %>%
      summarise(mean_gain = mean(weight_gain))
diet_summ
# # A tibble: 4 x 2
#   Diet  mean_gain
#   <fct>     <dbl>
# 1 1          115.
# 2 2          174 
# 3 3          230.
# 4 4          188.
```

---
## More on data manipulation

[Learn more on dplyr](http://r4ds.had.co.nz/transform.html)

[dplyr and tidyr cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

---
class: inverse, center, bottom

# Thank you for attending!

![:scale 50%](images/qcbs_logo.png)