3  Obtaining occurrence data from OBIS and GBIF

3.1 Selecting species for modeling

The first step was to establish the species occurring on the study area. We used the robis function dataset to retrieve all species occurring on the study area. Further on we filtered the dataset to:

  • retain only taxa at the species level
  • retain only taxa with accepted taxonomic status
  • remove Archaea, Bacteria, Fungi and Protozoa taxa
  • include only marine or brackish species

From that we obtained a final list of 22159 species.

In the case of GBIF, we first downloaded all data occurring within our study area. Then, using the worrms package we verified which of those species were marine. Then we performed the same filters used with the OBIS data. This resulted in 22782 unique species.

3.2 Downloading data

OBIS data was obtained from the full export available at https://obis.org/data/access/. However, code for downloading the data through the robis package is available. This is done through the obissdm package (which is being developed to support this project)

GBIF data was downloaded using the rgbif package, via the obissdm package.

3.3 Quality control steps (under development)

3.3.1 Duplicate records removal

We removed duplicated data points using GeoHash with a precision of 6 (width ≤ 1.22km X height 0.61km), and the year. Thus, for each combination of GeoHash cell and year, only one record was kept. That part is implemented in the mp_dup_check function, of the project package msdm.

Note

We note that, specifically for the SDMs, before modeling we do an additional duplicate removal in order to keep only a single record per cell.

3.3.2 Remove records on land

Records on land were removed based on openmap.

Note

We further filtered the records for the SDMs by keeping only those overlapping the environmental variable layers (which present some differences to the land layer used before).

3.3.3 Geographical and environmental outliers (flagging)

For the assessment of geographical outliers we implemented an innovative method that considers the existence of barriers when calculating the distance between points. Usually, geographical outliers are calculated based on the cartesian distance between the points. However, for marine species (indeed, also for terrestrial ones) the barriers are important because it constrains dispersal. Consider for example one species on the two sides of the Panama strait (Atlantic and Pacific). A straight line between the two points would be a short distance. However, if we take in account the barrier, then the animal (or its larvae) would need to travel a much longer distance to reach the other side of the strait.

In both the geographical and the environmental (Sea Surface Temperature, bathymetry and salinity) outlier assessment, we used a threshold of 3 times the inter quantile range to identify extreme points, but we just flagged the most extreme outlier until a limit of 1% of the points (i.e. if there were more points above the threshold, just the most extreme ones were flagged).