Data exploration - Stratified random surveys (StRS) of reef fish in the U.S. Pacific Islands
In this notebook we explore the NOAA Pacific Islands Fisheries Science Center, Ecosystem Sciences Division, National Coral Reef Monitoring Program: Stratified random surveys (StRS) of reef fish in the U.S. Pacific Islands dataset.
Data preparation
First load some packages, get the world polygons for creating maps, and initialize a cache.
library(robis)
library(dplyr)
library(ggplot2)
library(sf)
library(rnaturalearth)
library(memoise)
library(cachem)
<- ne_countries(scale = "medium", returnclass = "sf")
world <- cachem::cache_disk(dir = "cache", max_size = Inf) cache
Fetching data from the OBIS API
Fetch occurrence data from OBIS and make sure to include the MeasurementOrFact records. This step can take a while, so I’m using a local cache to speed up subsequent runs.
<- memoise(occurrence, cache = cache)(datasetid = "2ae2a2bd-8412-405b-8a9f-b71adc41d4c5", mof = TRUE) occ
Cleanup occurrence
individualCount
s are provided as character, so first convert to numeric:
<- occ %>%
occ mutate(individualCount = as.numeric(individualCount))
Extract measurements
The MeasurementOrFact records are nested inside the mof
column, use unnest_extension()
to extract them to a dedicated data frame.
<- unnest_extension(occ, "MeasurementOrFact", fields = c("eventID", "scientificName", "year", "individualCount", "decimalLongitude", "decimalLatitude", "island", "islandGroup", "stateProvince")) mof
Cleanup measurements
The consumer types are a bit cryptic, so let’s translate them to something human readable:
<- mof %>%
mof mutate(measurementValue = recode(measurementValue,
APEX = "apex predator",
Cor = "corallivore",
H = "herbivore",
MI = "mobile invertivore",
Om = "omnivore",
Par = "parrotfish",
Pisc = "piscivore",
Pk = "planktivore",
SI = "sessile invertivore",
X = "unidentified fish",
Z = "zooplantivore"
))
Data exploration
Time
Let’s take a look at the time distribution of our dataset:
ggplot(data = occ) +
geom_bar(aes(date_year), width = 1)
Locations
Check how many locations we have at different levels (station, island, island group):
%>%
occ group_by(decimalLongitude, decimalLatitude) %>%
summarize(records = n()) %>%
arrange(desc(records))
## # A tibble: 6,632 × 3
## # Groups: decimalLongitude [6,619]
## decimalLongitude decimalLatitude records
## <dbl> <dbl> <int>
## 1 -178. 28.4 368
## 2 -176. 0.198 338
## 3 146. 17.3 286
## 4 145. 20.0 277
## 5 -162. 5.87 276
## 6 146. 17.6 271
## 7 146. 17.3 268
## 8 -162. 6.39 266
## 9 145. 20.0 266
## 10 145. 20.0 264
## # … with 6,622 more rows
%>%
occ group_by(island) %>%
summarize(records = n()) %>%
arrange(desc(records))
## # A tibble: 45 × 2
## island records
## <chr> <int>
## 1 Tutuila 75141
## 2 Guam 40308
## 3 Jarvis 33195
## 4 Hawaii 31437
## 5 Palmyra 30140
## 6 Rose 25805
## 7 Pearl & Hermes 24544
## 8 Kingman 23721
## 9 Tau 23437
## 10 French Frigate 21087
## # … with 35 more rows
%>%
occ group_by(islandGroup) %>%
summarize(records = n()) %>%
arrange(desc(records))
## # A tibble: 6 × 2
## islandGroup records
## <chr> <int>
## 1 American Samoa 160912
## 2 Mariana Archipelago 145489
## 3 Pacific Remote Island Areas 145140
## 4 Main Hawaiian Islands 115367
## 5 Northwestern Hawaiian Islands 113250
## 6 Coral Triangle 18742
Count data in individualCount
Let’s take a look at the counts for a single species and a single island.
<- occ %>%
thadup filter(scientificName =="Thalassoma duperrey" & island == "Midway")
ggplot(thadup) +
geom_jitter(aes(date_year, individualCount))
Measurement types
Let’s first take a look at all measurementType
s and measurementTypeID
s (if any):
library(knitr)
<- mof %>%
mt group_by(measurementType, measurementTypeID) %>%
summarize(records = n()) %>%
arrange(desc(records))
%>%
mt kable()
measurementType | measurementTypeID | records |
---|---|---|
“LW_A” Fish Species Parameter used in length-weight calculations | NA | 698900 |
“LW_B” Fish Species Parameter used in length-weight calculations | NA | 698900 |
Consumer type: APEX (apex predator), Cor (corallivore), H (herbivore), MI (mobile invertivore), Om (Omnivore), Par (parrotfish), Pisc (piscivore), Pk (planktivore), SI (sessile invertivore), X (if unidentified fish), Z (zooplantivore) | NA | 698900 |
Theoretical maximum length for fish species | NA | 698900 |
Length conversion factor used to determine fish max length | NA | 698899 |
Trophic level (Primary Consumer, Secondary Consumer, Planktivore, Piscivore) | NA | 698874 |
Number of individuals of a species of the same length observed within a given time | NA | 698863 |
Fish Total Length, TL | NA | 698630 |
Crustose Coralline Algae Substrate Cover | NA | 604946 |
Hard Coral Substrate Cover | NA | 604946 |
Macroalgae Substrate Cover | NA | 604946 |
Sand Substrate Cover | NA | 604916 |
Underwater Visibility | NA | 544205 |
Current Strength (None, Slight, Moderate, High) | NA | 543349 |
Proportion of survey area between 0 cm and 20 cm elevation above seafloor depth | NA | 461677 |
Proportion of survey area between 100 cm and 150 cm elevation above seafloor depth | NA | 461677 |
Proportion of survey area between 20 cm and 50 cm elevation above seafloor depth | NA | 461677 |
Proportion of survey area between 50 cm and 100 cm elevation above seafloor depth | NA | 461677 |
Proportion of survey area over 150 cm elevation above seafloor depth | NA | 461677 |
Highest elevation point of substrate above seafloor depth in SPC cylinder | NA | 461665 |
Boring urchin abundance, estimated by DACOR, an abundance code based on visual estimation (Dominant, Abundant, Common, Occasional, and Rare) | NA | 461627 |
Free urchin abundance, estimated by DACOR, an abundance code based on visual estimation (Dominant, Abundant, Common, Occasional, and Rare) | NA | 461627 |
Turf Algae Substrate Cover | NA | 405523 |
Soft Coral Substrate Cover | NA | 399476 |
Bivalve Substrate Cover | NA | 340936 |
Corallimorpharian Substrate Cover | NA | 340936 |
Tunicate Substrate Cover | NA | 340936 |
Zoantharian Substrate Cover | NA | 340936 |
Cyanobacteria Substrate Cover | NA | 318563 |
Sponge Substrate Cover | NA | 162617 |
Trophic level
Aggregate the count data by island and trophic level (Primary Consumer, Secondary Consumer, Planktivore, Piscivore):
<- mof %>%
mof_trophic filter(measurementType == "Trophic level (Primary Consumer, Secondary Consumer, Planktivore, Piscivore)" & !is.na(individualCount)) %>%
mutate(measurementValue = factor(tolower(measurementValue), levels = c("piscivore", "planktivore", "secondary", "primary")))
<- mof_trophic %>%
trophic_stats group_by(island) %>%
mutate(individuals = sum(individualCount)) %>%
group_by(island, measurementValue) %>%
summarize(fraction = sum(individualCount / individuals))
<- trophic_stats %>% filter(measurementValue == "primary") %>% arrange(fraction) %>% pull(island)
island_order
$island <- factor(trophic_stats$island, levels = island_order)
trophic_stats
ggplot(trophic_stats) +
geom_bar(aes(island, fraction, fill = measurementValue), position = "stack", stat = "identity") +
coord_flip() +
labs(fill = "Trophic level") +
scale_fill_brewer(palette = "Paired")
Consumer types
Aggregate the count data by island and consumer type:
<- mof %>%
mof_consumer filter(measurementType == "Consumer type: APEX (apex predator), Cor (corallivore), H (herbivore), MI (mobile invertivore), Om (Omnivore), Par (parrotfish), Pisc (piscivore), Pk (planktivore), SI (sessile invertivore), X (if unidentified fish), Z (zooplantivore)" & !is.na(individualCount))
<- mof_consumer %>%
consumer_stats group_by(island) %>%
mutate(individuals = sum(individualCount)) %>%
group_by(island, measurementValue) %>%
summarize(fraction = sum(individualCount / individuals))
ggplot(consumer_stats) +
geom_bar(aes(island, fraction, fill = measurementValue), position = "stack", stat = "identity") +
coord_flip() +
labs(fill = "Consumer type") +
scale_fill_brewer(palette = "Paired")
Length
Calculate mean fish length by island:
<- mof %>%
mof_length filter(measurementType == "Fish Total Length, TL") %>%
mutate(measurementValue = as.numeric(measurementValue))
<- mof_length %>%
length_stats group_by(island) %>%
summarize(
decimalLongitude = mean(decimalLongitude),
decimalLatitude = mean(decimalLatitude),
mean_length = sum(measurementValue * individualCount) / sum(individualCount),
median_length = median(rep(measurementValue, individualCount)),
q1_length = quantile(rep(measurementValue, individualCount))[2],
q3_length = quantile(rep(measurementValue, individualCount))[4],
sd_length = sd(rep(measurementValue, individualCount))
%>%
) st_as_sf(coords = c("decimalLongitude", "decimalLatitude"), crs = 4326)
$island <- reorder(length_stats$island, length_stats$mean_length)
length_stats
ggplot() +
geom_sf(data = world, color = NA) +
geom_sf(data = length_stats, aes(size = mean_length), fill = NA, shape = 21) +
scale_size(range = c(0, 6), name = "Mean fish length (cm)") +
theme_void()
ggplot(length_stats) +
geom_errorbarh(aes(y = island, xmin = mean_length - sd_length, xmax = mean_length + sd_length)) +
geom_point(aes(y = island, x = mean_length), shape = 21, fill = "white", size = 3) +
geom_point(aes(y = island, x = median_length), shape = 6, fill = "white", size = 2) +
xlab("Fish length (cm)")