Chapter 7 Exploring the dataset
We will use two main data sets in the first part of this workshop.
They come from Verneaux’s PhD thesis (1973), where he proposed to use fish species to characterize ecological zones along European rivers and streams.
He collected data at 30 localities along the Doubs river, which runs near the France-Switzerland border, in the Jura Mountains.
He showed that fish communities were biological indicators of these water bodies.
Their data is split in three matrices:
- The abundance of 27 fish species across the communities (
DoubsSpe.csv
and hereon, thespe
object); - The environmental variables recorded at each site (
DoubsEnv.csv
and hereon, theenv
object); and, - The geographical coordinates of each site.
Verneaux, J. (1973) Cours d’eau de Franche-Comté (Massif du Jura). Recherches écologiques sur le réseau hydrographique du Doubs. Essai de biotypologie. Thèse d’état, Besançon. 1–257.
7.1 Doubs river fish communities
You can download these datasets from r.qcbs.ca/workshops/r-workshop-09.
We can load their data from the data/
directory in this workshop:
<- read.csv("data/doubsspe.csv", row.names = 1)
spe
<- read.csv("data/doubsenv.csv", row.names = 1) env
Their data can also be retrieved from the ade4
package:
library(ade4)
data(doubs)
<- doubs$fish
spe <- doubs$env env
Alternatively, from the codep
package:
library(codep)
data(Doubs)
<- Doubs.fish
spe <- Doubs.env env
We can then explore the objects containing our newly loaded data.
Let us peek into the spe
data:
head(spe)[, 1:8]
## CHA TRU VAI LOC OMB BLA HOT TOX
## 1 0 3 0 0 0 0 0 0
## 2 0 5 4 3 0 0 0 0
## 3 0 5 5 5 0 0 0 0
## 4 0 4 5 5 0 0 0 0
## 5 0 2 3 2 0 0 0 0
## 6 0 3 4 5 0 0 0 0
We can also use the str()
function, which we learned in Workshops 1 and 2:
str(spe)
## 'data.frame': 30 obs. of 27 variables:
## $ CHA: int 0 0 0 0 0 0 0 0 0 0 ...
## $ TRU: int 3 5 5 4 2 3 5 0 0 1 ...
## $ VAI: int 0 4 5 5 3 4 4 0 1 4 ...
## $ LOC: int 0 3 5 5 2 5 5 0 3 4 ...
## $ OMB: int 0 0 0 0 0 0 0 0 0 0 ...
## $ BLA: int 0 0 0 0 0 0 0 0 0 0 ...
## $ HOT: int 0 0 0 0 0 0 0 0 0 0 ...
## $ TOX: int 0 0 0 0 0 0 0 0 0 0 ...
## $ VAN: int 0 0 0 0 5 1 1 0 0 2 ...
## $ CHE: int 0 0 0 1 2 2 1 0 5 2 ...
## $ BAR: int 0 0 0 0 0 0 0 0 0 0 ...
## $ SPI: int 0 0 0 0 0 0 0 0 0 0 ...
## $ GOU: int 0 0 0 1 2 1 0 0 0 1 ...
## $ BRO: int 0 0 1 2 4 1 0 0 0 0 ...
## $ PER: int 0 0 0 2 4 1 0 0 0 0 ...
## $ BOU: int 0 0 0 0 0 0 0 0 0 0 ...
## $ PSO: int 0 0 0 0 0 0 0 0 0 0 ...
## $ ROT: int 0 0 0 0 2 0 0 0 0 0 ...
## $ CAR: int 0 0 0 0 0 0 0 0 0 0 ...
## $ TAN: int 0 0 0 1 3 2 0 0 1 0 ...
## $ BCO: int 0 0 0 0 0 0 0 0 0 0 ...
## $ PCH: int 0 0 0 0 0 0 0 0 0 0 ...
## $ GRE: int 0 0 0 0 0 0 0 0 0 0 ...
## $ GAR: int 0 0 0 0 5 1 0 0 4 0 ...
## $ BBO: int 0 0 0 0 0 0 0 0 0 0 ...
## $ ABL: int 0 0 0 0 0 0 0 0 0 0 ...
## $ ANG: int 0 0 0 0 0 0 0 0 0 0 ...
You can also try some of these!
# Try some of these!
names(spe) # names of objects
dim(spe) # dimensions
str(spe) # structure of objects
summary(spe) # summary statistics
head(spe) # first 6 rows
7.2 Doubs river environmental data
str(env)
## 'data.frame': 30 obs. of 11 variables:
## $ das: num 0.3 2.2 10.2 18.5 21.5 32.4 36.8 49.1 70.5 99 ...
## $ alt: int 934 932 914 854 849 846 841 792 752 617 ...
## $ pen: num 48 3 3.7 3.2 2.3 3.2 6.6 2.5 1.2 9.9 ...
## $ deb: num 0.84 1 1.8 2.53 2.64 2.86 4 1.3 4.8 10 ...
## $ pH : num 7.9 8 8.3 8 8.1 7.9 8.1 8.1 8 7.7 ...
## $ dur: int 45 40 52 72 84 60 88 94 90 82 ...
## $ pho: num 0.01 0.02 0.05 0.1 0.38 0.2 0.07 0.2 0.3 0.06 ...
## $ nit: num 0.2 0.2 0.22 0.21 0.52 0.15 0.15 0.41 0.82 0.75 ...
## $ amm: num 0 0.1 0.05 0 0.2 0 0 0.12 0.12 0.01 ...
## $ oxy: num 12.2 10.3 10.5 11 8 10.2 11.1 7 7.2 10 ...
## $ dbo: num 2.7 1.9 3.5 1.3 6.2 5.3 2.2 8.1 5.2 4.3 ...
It contains the following variables:
Variable | Description |
---|---|
das | Distance from the source [km] |
alt | Altitude [m a.s.l.] |
pen | Slope [per thousand] |
deb | Mean min. discharge [m3s-1] |
pH | pH of water |
dur | Ca conc. (hardness) [mgL-1] |
pho | K conc. [mgL-1] |
nit | N conc. [mgL-1] |
amn | NH₄⁺ conc. [mgL-1] |
oxy | Diss. oxygen [mgL-1] |
dbo | Biol. oxygen demand [mgL-1] |
You can also use summary()
to obtain summary statistics from the variables in env
:
summary(env) # summary statistics