3 Obtaining occurrence data from OBIS and GBIF
3.1 Selecting species for modeling
The first step was to establish the species occurring on the study area. We used the robis
function dataset
to retrieve all species occurring on the study area. Further on we filtered the dataset to:
- retain only taxa at the species level
- retain only taxa with accepted taxonomic status
- remove Archaea, Bacteria, Fungi and Protozoa taxa
- include only marine or brackish species
From that we obtained a final list of 22159 species.
In the case of GBIF, we first downloaded all data occurring within our study area. Then, using the worrms
package we verified which of those species were marine. Then we performed the same filters used with the OBIS data. This resulted in 22782 unique species.
3.2 Downloading data
OBIS data was obtained from the full export available at https://obis.org/data/access/. However, code for downloading the data through the robis
package is available. This is done through the obissdm
package (which is being developed to support this project)
GBIF data was downloaded using the rgbif
package, via the obissdm
package.
3.3 Quality control steps (under development)
3.3.1 Duplicate records removal
We removed duplicated data points using GeoHash with a precision of 6 (width ≤ 1.22km X height 0.61km), and the year. Thus, for each combination of GeoHash cell and year, only one record was kept. That part is implemented in the mp_dup_check
function, of the project package msdm
.
We note that, specifically for the SDMs, before modeling we do an additional duplicate removal in order to keep only a single record per cell.
3.3.2 Remove records on land
Records on land were removed based on openmap.
We further filtered the records for the SDMs by keeping only those overlapping the environmental variable layers (which present some differences to the land layer used before).
3.3.3 Geographical and environmental outliers (flagging)
For the assessment of geographical outliers we implemented an innovative method that considers the existence of barriers when calculating the distance between points. Usually, geographical outliers are calculated based on the cartesian distance between the points. However, for marine species (indeed, also for terrestrial ones) the barriers are important because it constrains dispersal. Consider for example one species on the two sides of the Panama strait (Atlantic and Pacific). A straight line between the two points would be a short distance. However, if we take in account the barrier, then the animal (or its larvae) would need to travel a much longer distance to reach the other side of the strait.
In both the geographical and the environmental (Sea Surface Temperature, bathymetry and salinity) outlier assessment, we used a threshold of 3 times the inter quantile range to identify extreme points, but we just flagged the most extreme outlier until a limit of 1% of the points (i.e. if there were more points above the threshold, just the most extreme ones were flagged).