Data exploration - Stratified random surveys (StRS) of reef fish in the U.S. Pacific Islands

In this notebook we explore the NOAA Pacific Islands Fisheries Science Center, Ecosystem Sciences Division, National Coral Reef Monitoring Program: Stratified random surveys (StRS) of reef fish in the U.S. Pacific Islands dataset.

Data preparation

First load some packages, get the world polygons for creating maps, and initialize a cache.

library(robis)
library(dplyr)
library(ggplot2)
library(sf)
library(rnaturalearth)
library(memoise)
library(cachem)

world <- ne_countries(scale = "medium", returnclass = "sf")
cache <- cachem::cache_disk(dir = "cache", max_size = Inf)

Fetching data from the OBIS API

Fetch occurrence data from OBIS and make sure to include the MeasurementOrFact records. This step can take a while, so I’m using a local cache to speed up subsequent runs.

occ <- memoise(occurrence, cache = cache)(datasetid = "2ae2a2bd-8412-405b-8a9f-b71adc41d4c5", mof = TRUE)

Cleanup occurrence

individualCounts are provided as character, so first convert to numeric:

occ <- occ %>%
  mutate(individualCount = as.numeric(individualCount))

Extract measurements

The MeasurementOrFact records are nested inside the mof column, use unnest_extension() to extract them to a dedicated data frame.

mof <- unnest_extension(occ, "MeasurementOrFact", fields = c("eventID", "scientificName", "year", "individualCount", "decimalLongitude", "decimalLatitude", "island", "islandGroup", "stateProvince"))

Cleanup measurements

The consumer types are a bit cryptic, so let’s translate them to something human readable:

mof <- mof %>%
  mutate(measurementValue = recode(measurementValue, 
    APEX = "apex predator",
    Cor = "corallivore",
    H = "herbivore",
    MI = "mobile invertivore",
    Om = "omnivore",
    Par = "parrotfish",
    Pisc = "piscivore",
    Pk = "planktivore",
    SI = "sessile invertivore",
    X = "unidentified fish",
    Z = "zooplantivore"
  ))

Data exploration

Time

Let’s take a look at the time distribution of our dataset:

ggplot(data = occ) + 
  geom_bar(aes(date_year), width = 1)

Locations

Check how many locations we have at different levels (station, island, island group):

occ %>%
  group_by(decimalLongitude, decimalLatitude) %>%
  summarize(records = n()) %>%
  arrange(desc(records))
## # A tibble: 6,632 × 3
## # Groups:   decimalLongitude [6,619]
##    decimalLongitude decimalLatitude records
##               <dbl>           <dbl>   <int>
##  1            -178.          28.4       368
##  2            -176.           0.198     338
##  3             146.          17.3       286
##  4             145.          20.0       277
##  5            -162.           5.87      276
##  6             146.          17.6       271
##  7             146.          17.3       268
##  8            -162.           6.39      266
##  9             145.          20.0       266
## 10             145.          20.0       264
## # … with 6,622 more rows
occ %>%
  group_by(island) %>%
  summarize(records = n()) %>%
  arrange(desc(records))
## # A tibble: 45 × 2
##    island         records
##    <chr>            <int>
##  1 Tutuila          75141
##  2 Guam             40308
##  3 Jarvis           33195
##  4 Hawaii           31437
##  5 Palmyra          30140
##  6 Rose             25805
##  7 Pearl & Hermes   24544
##  8 Kingman          23721
##  9 Tau              23437
## 10 French Frigate   21087
## # … with 35 more rows
occ %>%
  group_by(islandGroup) %>%
  summarize(records = n()) %>%
  arrange(desc(records))
## # A tibble: 6 × 2
##   islandGroup                   records
##   <chr>                           <int>
## 1 American Samoa                 160912
## 2 Mariana Archipelago            145489
## 3 Pacific Remote Island Areas    145140
## 4 Main Hawaiian Islands          115367
## 5 Northwestern Hawaiian Islands  113250
## 6 Coral Triangle                  18742

Count data in individualCount

Let’s take a look at the counts for a single species and a single island.

thadup <- occ %>%
  filter(scientificName =="Thalassoma duperrey" & island == "Midway")

ggplot(thadup) +
  geom_jitter(aes(date_year, individualCount))

Measurement types

Let’s first take a look at all measurementTypes and measurementTypeIDs (if any):

library(knitr)

mt <- mof %>%
  group_by(measurementType, measurementTypeID) %>%
  summarize(records = n()) %>%
  arrange(desc(records))

mt %>%
  kable()
measurementType measurementTypeID records
“LW_A” Fish Species Parameter used in length-weight calculations NA 698900
“LW_B” Fish Species Parameter used in length-weight calculations NA 698900
Consumer type: APEX (apex predator), Cor (corallivore), H (herbivore), MI (mobile invertivore), Om (Omnivore), Par (parrotfish), Pisc (piscivore), Pk (planktivore), SI (sessile invertivore), X (if unidentified fish), Z (zooplantivore) NA 698900
Theoretical maximum length for fish species NA 698900
Length conversion factor used to determine fish max length NA 698899
Trophic level (Primary Consumer, Secondary Consumer, Planktivore, Piscivore) NA 698874
Number of individuals of a species of the same length observed within a given time NA 698863
Fish Total Length, TL NA 698630
Crustose Coralline Algae Substrate Cover NA 604946
Hard Coral Substrate Cover NA 604946
Macroalgae Substrate Cover NA 604946
Sand Substrate Cover NA 604916
Underwater Visibility NA 544205
Current Strength (None, Slight, Moderate, High) NA 543349
Proportion of survey area between 0 cm and 20 cm elevation above seafloor depth NA 461677
Proportion of survey area between 100 cm and 150 cm elevation above seafloor depth NA 461677
Proportion of survey area between 20 cm and 50 cm elevation above seafloor depth NA 461677
Proportion of survey area between 50 cm and 100 cm elevation above seafloor depth NA 461677
Proportion of survey area over 150 cm elevation above seafloor depth NA 461677
Highest elevation point of substrate above seafloor depth in SPC cylinder NA 461665
Boring urchin abundance, estimated by DACOR, an abundance code based on visual estimation (Dominant, Abundant, Common, Occasional, and Rare) NA 461627
Free urchin abundance, estimated by DACOR, an abundance code based on visual estimation (Dominant, Abundant, Common, Occasional, and Rare) NA 461627
Turf Algae Substrate Cover NA 405523
Soft Coral Substrate Cover NA 399476
Bivalve Substrate Cover NA 340936
Corallimorpharian Substrate Cover NA 340936
Tunicate Substrate Cover NA 340936
Zoantharian Substrate Cover NA 340936
Cyanobacteria Substrate Cover NA 318563
Sponge Substrate Cover NA 162617

Trophic level

Aggregate the count data by island and trophic level (Primary Consumer, Secondary Consumer, Planktivore, Piscivore):

mof_trophic <- mof %>%
  filter(measurementType == "Trophic level (Primary Consumer, Secondary Consumer, Planktivore, Piscivore)" & !is.na(individualCount)) %>%
  mutate(measurementValue = factor(tolower(measurementValue), levels = c("piscivore", "planktivore", "secondary", "primary")))

trophic_stats <- mof_trophic %>%
  group_by(island) %>%
  mutate(individuals = sum(individualCount)) %>%
  group_by(island, measurementValue) %>%
  summarize(fraction = sum(individualCount / individuals))

island_order <- trophic_stats %>% filter(measurementValue == "primary") %>% arrange(fraction) %>% pull(island)

trophic_stats$island <- factor(trophic_stats$island, levels = island_order)

ggplot(trophic_stats) +
  geom_bar(aes(island, fraction, fill = measurementValue), position = "stack", stat = "identity") +
  coord_flip() +
  labs(fill = "Trophic level") +
  scale_fill_brewer(palette = "Paired")

Consumer types

Aggregate the count data by island and consumer type:

mof_consumer <- mof %>%
  filter(measurementType == "Consumer type: APEX (apex predator), Cor (corallivore), H (herbivore), MI (mobile invertivore), Om (Omnivore), Par (parrotfish), Pisc (piscivore), Pk (planktivore), SI (sessile invertivore), X (if unidentified fish), Z (zooplantivore)" & !is.na(individualCount))

consumer_stats <- mof_consumer %>%
  group_by(island) %>%
  mutate(individuals = sum(individualCount)) %>%
  group_by(island, measurementValue) %>%
  summarize(fraction = sum(individualCount / individuals))

ggplot(consumer_stats) +
  geom_bar(aes(island, fraction, fill = measurementValue), position = "stack", stat = "identity") +
  coord_flip() +
  labs(fill = "Consumer type") +
  scale_fill_brewer(palette = "Paired")

Length

Calculate mean fish length by island:

mof_length <- mof %>%
  filter(measurementType == "Fish Total Length, TL") %>%
  mutate(measurementValue = as.numeric(measurementValue))

length_stats <- mof_length %>%
  group_by(island) %>%
  summarize(
    decimalLongitude = mean(decimalLongitude),
    decimalLatitude = mean(decimalLatitude),
    mean_length = sum(measurementValue * individualCount) / sum(individualCount),
    median_length = median(rep(measurementValue, individualCount)),
    q1_length = quantile(rep(measurementValue, individualCount))[2],
    q3_length = quantile(rep(measurementValue, individualCount))[4],
    sd_length = sd(rep(measurementValue, individualCount))
  ) %>%
  st_as_sf(coords = c("decimalLongitude", "decimalLatitude"), crs = 4326)

length_stats$island <- reorder(length_stats$island, length_stats$mean_length)

ggplot() + 
  geom_sf(data = world, color = NA) +
  geom_sf(data = length_stats, aes(size = mean_length), fill = NA, shape = 21) +
  scale_size(range = c(0, 6), name = "Mean fish length (cm)") +
  theme_void()

ggplot(length_stats) +
  geom_errorbarh(aes(y = island, xmin = mean_length - sd_length, xmax = mean_length + sd_length)) +
  geom_point(aes(y = island, x = mean_length), shape = 21, fill = "white", size = 3) +
  geom_point(aes(y = island, x = median_length), shape = 6, fill = "white", size = 2) +
  xlab("Fish length (cm)")