# Chapter 7 Exploring the dataset

We will use two main data sets in the first part of this workshop.

They come from Verneaux’s PhD thesis (1973), where he proposed to use fish species to characterize ecological zones along European rivers and streams.

He collected data at 30 localities along the Doubs river, which runs near the France-Switzerland border, in the Jura Mountains.

He showed that fish communities were biological indicators of these water bodies.

Their data is split in three matrices:

1. The abundance of 27 fish species across the communities (DoubsSpe.csv and hereon, the spe object);
2. The environmental variables recorded at each site (DoubsEnv.csv and hereon, the env object); and,
3. The geographical coordinates of each site.

Verneaux, J. (1973) Cours d’eau de Franche-Comté (Massif du Jura). Recherches écologiques sur le réseau hydrographique du Doubs. Essai de biotypologie. Thèse d’état, Besançon. 1–257.

## 7.1 Doubs river fish communities

We can load their data from the data/ directory in this workshop:

spe <- read.csv("data/doubsspe.csv", row.names = 1)

env <- read.csv("data/doubsenv.csv", row.names = 1)

Their data can also be retrieved from the ade4 package:

library(ade4)
data(doubs)

spe <- doubs$fish env <- doubs$env

Alternatively, from the codep package:

library(codep)
data(Doubs)

spe <- Doubs.fish
env <- Doubs.env

We can then explore the objects containing our newly loaded data.

Let us peek into the spe data:

head(spe)[, 1:8]
##   CHA TRU VAI LOC OMB BLA HOT TOX
## 1   0   3   0   0   0   0   0   0
## 2   0   5   4   3   0   0   0   0
## 3   0   5   5   5   0   0   0   0
## 4   0   4   5   5   0   0   0   0
## 5   0   2   3   2   0   0   0   0
## 6   0   3   4   5   0   0   0   0

We can also use the str() function, which we learned in Workshops 1 and 2:

str(spe)
## 'data.frame':    30 obs. of  27 variables:
##  $CHA: int 0 0 0 0 0 0 0 0 0 0 ... ##$ TRU: int  3 5 5 4 2 3 5 0 0 1 ...
##  $VAI: int 0 4 5 5 3 4 4 0 1 4 ... ##$ LOC: int  0 3 5 5 2 5 5 0 3 4 ...
##  $OMB: int 0 0 0 0 0 0 0 0 0 0 ... ##$ BLA: int  0 0 0 0 0 0 0 0 0 0 ...
##  $HOT: int 0 0 0 0 0 0 0 0 0 0 ... ##$ TOX: int  0 0 0 0 0 0 0 0 0 0 ...
##  $VAN: int 0 0 0 0 5 1 1 0 0 2 ... ##$ CHE: int  0 0 0 1 2 2 1 0 5 2 ...
##  $BAR: int 0 0 0 0 0 0 0 0 0 0 ... ##$ SPI: int  0 0 0 0 0 0 0 0 0 0 ...
##  $GOU: int 0 0 0 1 2 1 0 0 0 1 ... ##$ BRO: int  0 0 1 2 4 1 0 0 0 0 ...
##  $PER: int 0 0 0 2 4 1 0 0 0 0 ... ##$ BOU: int  0 0 0 0 0 0 0 0 0 0 ...
##  $PSO: int 0 0 0 0 0 0 0 0 0 0 ... ##$ ROT: int  0 0 0 0 2 0 0 0 0 0 ...
##  $CAR: int 0 0 0 0 0 0 0 0 0 0 ... ##$ TAN: int  0 0 0 1 3 2 0 0 1 0 ...
##  $BCO: int 0 0 0 0 0 0 0 0 0 0 ... ##$ PCH: int  0 0 0 0 0 0 0 0 0 0 ...
##  $GRE: int 0 0 0 0 0 0 0 0 0 0 ... ##$ GAR: int  0 0 0 0 5 1 0 0 4 0 ...
##  $BBO: int 0 0 0 0 0 0 0 0 0 0 ... ##$ ABL: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ANG: int 0 0 0 0 0 0 0 0 0 0 ... You can also try some of these! # Try some of these! names(spe) # names of objects dim(spe) # dimensions str(spe) # structure of objects summary(spe) # summary statistics head(spe) # first 6 rows ## 7.2 Doubs river environmental data str(env) ## 'data.frame': 30 obs. of 11 variables: ##$ das: num  0.3 2.2 10.2 18.5 21.5 32.4 36.8 49.1 70.5 99 ...
##  $alt: int 934 932 914 854 849 846 841 792 752 617 ... ##$ pen: num  48 3 3.7 3.2 2.3 3.2 6.6 2.5 1.2 9.9 ...
##  $deb: num 0.84 1 1.8 2.53 2.64 2.86 4 1.3 4.8 10 ... ##$ pH : num  7.9 8 8.3 8 8.1 7.9 8.1 8.1 8 7.7 ...
##  $dur: int 45 40 52 72 84 60 88 94 90 82 ... ##$ pho: num  0.01 0.02 0.05 0.1 0.38 0.2 0.07 0.2 0.3 0.06 ...
##  $nit: num 0.2 0.2 0.22 0.21 0.52 0.15 0.15 0.41 0.82 0.75 ... ##$ amm: num  0 0.1 0.05 0 0.2 0 0 0.12 0.12 0.01 ...
##  $oxy: num 12.2 10.3 10.5 11 8 10.2 11.1 7 7.2 10 ... ##$ dbo: num  2.7 1.9 3.5 1.3 6.2 5.3 2.2 8.1 5.2 4.3 ...

It contains the following variables:

Variable Description
das Distance from the source [km]
alt Altitude [m a.s.l.]
pen Slope [per thousand]
deb Mean min. discharge [m3s-1]
pH pH of water
dur Ca conc. (hardness) [mgL-1]
pho K conc. [mgL-1]
nit N conc. [mgL-1]
amn NH₄⁺ conc. [mgL-1]
oxy Diss. oxygen [mgL-1]
dbo Biol. oxygen demand [mgL-1]

You can also use summary() to obtain summary statistics from the variables in env:

summary(env)  # summary statistics