GBIF marine datasets
This notebook uses the GBIF API to generate statistics on marine data in GBIF. Note that this is a preliminary analysis which only uses species. The process looks like this:
- Get all marine species names from WoRMS
- Using the species names from WoRMS, get all profiles of marine species from GBIF
- Using the
nubKey
s from the species profiles, get record counts by dataset from GBIF
Get all marine species from WoRMS
This uses a full export of the WoRMS database. This file is not included in the repository, contact the WoRMS team to get access to such an export.
library(dplyr)
library(data.table)
<- fread("worms_export/taxon.txt", sep = "\t", na.strings = "", quote = "") %>%
taxon as_tibble()
<- fread("worms_export/speciesprofile.txt", sep = "\t", na.strings = "", quote = "") %>%
speciesprofile as_tibble()
<- taxon %>%
taxon_marine left_join(speciesprofile, by = "taxonID", suffix = c("", ".y")) %>%
filter(taxonomicStatus == "accepted" & isMarine == 1 & taxonRank == "Species")
<- unique(taxon_marine$scientificName) species
Fetch species profiles from GBIF
Here we use the GBIF API to search for species by name. The results are stored as CSV files in the profiles
folder. Once all species have been processed, read the CSV files.
library(stringr)
library(jsonlite)
library(progress)
library(purrr)
if (!file.exists("profiles.rds")) {
<- progress_bar$new(total = length(species), format = "[:bar] :current/:total (:percent) ETA: :eta")
pb
for (sp in species) {
<- str_replace(tolower(sp), "\\s", "_")
key <- paste0("profiles/", key, ".csv")
filename if (!file.exists(filename)) {
<- URLencode(paste0("https://api.gbif.org/v1/species?name=", sp))
url <- fromJSON(url)$results
res if (length(res) > 0 & "nubKey" %in% names(res)) {
<- res %>%
species_names select(key, nubKey, nameKey, taxonID)
write.csv(species_names, filename, row.names = FALSE, na = "")
else {
} write.csv(data.frame(nubKey = character(0)), filename, row.names = FALSE, na = "")
}
} $tick()
pb
}
<- list.files(path = "profiles", pattern = "*.csv", full.names = TRUE)
files <- map(files, ~read.csv(.)) %>%
profiles bind_rows()
else {
} <- readRDS("profiles.rds")
profiles }
Get occurrence counts by dataset for each species
Here we use another API endpoint to get the number of records per dataset for each species. Results are stored in the statistics
folder as CSV files.
if (!file.exists("statistics.rds")) {
<- na.omit(unique(profiles$nubKey))
nubkeys <- progress_bar$new(total = length(nubkeys), format = "[:bar] :current/:total (:percent) ETA: :eta")
pb
for (nubkey in nubkeys) {
<- paste0("statistics/", nubkey, ".csv")
filename if (!file.exists(filename)) {
<- URLencode(paste0("https://api.gbif.org/v1/occurrence/counts/datasets?nubKey=", nubkey))
url <- fromJSON(url)
res if (length(res) > 0) {
<- data.frame(dataset = names(res), records = unlist(res))
df write.csv(df, filename, row.names = FALSE, na = "")
else {
} write.csv(data.frame(dataset = character(0), records = integer(0)), filename, row.names = FALSE, na = "")
}
} $tick()
pb
}
<- list.files(path = "statistics", pattern = "*.csv", full.names = TRUE)
files <- map(files, ~read.csv(., colClasses = c("character", "integer"))) %>%
statistics bind_rows()
else {
} <- readRDS("statistics.rds")
statistics }
Calculate statistics
<- statistics %>%
stats group_by(dataset) %>%
summarize(records = sum(records)) %>%
arrange(desc(records))
<- format(nrow(stats), big.mark = ",")
n_datasets <- format(sum(stats$records), big.mark = ",")
n_records
%>%
stats ::paged_table() rmarkdown
In total we have found 15,364 datasets containing marine species, for a total of 223,870,046 marine species records.