DNADerivedData extension data access

In this notebook we will explore data published to OBIS using the new DNADerivedData extension. This notebook is available at https://iobis.github.io/notebook-dnaderiveddata/.

Using R

Installing the robis package

First install the robis package. If the package is already installed, make sure the version is 2.10.0 or higher.

if (!"robis" %in% rownames(installed.packages())) {
  remotes::install_github("iobis/robis")
} else if (packageVersion("robis") < "2.10.0") {
  knitr::knit_exit()
}

We will also need the following packages:

library(robis)
library(dplyr)
library(ggplot2)
library(knitr)
library(ggtree)

Finding sequence based data

Before we fetch any occurrence data let’s first find out which datasets in OBIS are using the DNADerivedData extension. The dataset() function takes many of the same parameters as the occurrence() function, including hasextensions which allows us to limit our search to occurrences which have linked extension records of a specific type. Possible values include MeasurementOrFact and DNADerivedData.

dna_datasets <- dataset(hasextensions = "DNADerivedData")

dna_datasets %>%
  select(url, title, records) %>%
  kable()
url title records
https://www1.usgs.gov/obis-usa/ipt/resource?r=18s_monterey_bay_time_series_edna 18S Monterey Bay Time Series: an eDNA data set from Monterey Bay, California, including years 2006, 2013 - 2016 64837
mapview::mapview(sf::st_as_sfc(dna_datasets$extent, crs = 4326))

Fetching sequence based data

Let’s fetch all occurrences from the 18S Monterey Bay Time Series dataset by using the datasetid parameter. Alternatively, set hasextensions to DNADerivedData. Also set the extensions parameter to ensure that the nested extension records are included in the results.

occ <- occurrence(datasetid = "62b97724-da17-4ca7-9b26-b2a22aeaab51", extensions = "DNADerivedData")
occ
## # A tibble: 64,837 × 93
##    eventID   brackish date_year scientificNameID   scientificName infrakingdomid
##    <chr>     <lgl>        <int> <chr>              <chr>                   <int>
##  1 CN13Dc01… FALSE         2013 urn:lsid:marinesp… Cercozoa               582420
##  2 32414c01… NA            2014 urn:lsid:marinesp… Dolichomastig…             NA
##  3 CN13Dc01… NA            2013 urn:lsid:marinesp… Hemiaulaceae           368898
##  4 18815c01… FALSE         2015 urn:lsid:marinesp… Spirotrichea           536209
##  5 11216c01… TRUE          2016 urn:lsid:marinesp… Biota                      NA
##  6 05114c01… FALSE         2014 urn:lsid:marinesp… Cercozoa               582420
##  7 32414c01… NA            2014 urn:lsid:marinesp… Proterythrops…         536209
##  8 11216c01… TRUE          2016 urn:lsid:marinesp… Dinophyceae            536209
##  9 28215c01… TRUE          2015 urn:lsid:marinesp… Biota                      NA
## 10 30214c01… TRUE          2014 urn:lsid:marinesp… Protoperidini…         536209
## # … with 64,827 more rows, and 87 more variables: absence <lgl>, dropped <lgl>,
## #   aphiaID <int>, decimalLatitude <dbl>, taxonID <chr>,
## #   originalScientificName <chr>, marine <lgl>, phylumid <int>,
## #   basisOfRecord <chr>, terrestrial <lgl>, subkingdom <chr>, date_mid <dbl>,
## #   identificationRemarks <chr>, nameAccordingTo <chr>, id <chr>,
## #   identificationReferences <chr>, organismQuantity <chr>,
## #   sampleSizeUnit <chr>, dataset_id <chr>, decimalLongitude <dbl>, …

Let’s take a look at the taxonomic composition of our dataset:

occ %>%
  group_by(phylum) %>%
  summarize(species = length(unique(speciesid)), records = n()) %>%
  arrange(desc(species))
## # A tibble: 42 × 3
##    phylum      species records
##    <chr>         <int>   <int>
##  1 Myzozoa          60   21267
##  2 Ochrophyta       46    6285
##  3 Ciliophora       16    3425
##  4 Chlorophyta      15    2540
##  5 Cryptophyta      12    1424
##  6 Haptophyta       10    1442
##  7 Cnidaria          8     202
##  8 Arthropoda        7     848
##  9 Cercozoa          7    3279
## 10 Annelida          6     240
## # … with 32 more rows

Extracting DNADerivedData records

The DNADerivedData records are currently nested within the occurrences we fetched from OBIS. To extract the extension records we can use the unnest_extension() function. Pass any fields from the occurrence table you would like to keep using the fields parameter.

dna <- unnest_extension(occ, "DNADerivedData", fields = c("id", "phylum", "class", "order", "family", "genus", "species", "date_year"))
dna
## # A tibble: 64,837 × 128
##    id         phylum  class   order   family  genus  species date_year samp_name
##    <chr>      <chr>   <chr>   <chr>   <chr>   <chr>  <chr>       <int> <chr>    
##  1 00015cdd-… Cercoz… <NA>    <NA>    <NA>    <NA>   <NA>         2013 <NA>     
##  2 00032b86-… Chloro… Mamiel… Dolich… <NA>    <NA>   <NA>         2014 <NA>     
##  3 00069a4d-… Ochrop… Bacill… Hemiau… Hemiau… <NA>   <NA>         2013 <NA>     
##  4 0006bc59-… Ciliop… Spirot… <NA>    <NA>    <NA>   <NA>         2015 <NA>     
##  5 000a11ec-… <NA>    <NA>    <NA>    <NA>    <NA>   <NA>         2016 <NA>     
##  6 000a8381-… Cercoz… <NA>    <NA>    <NA>    <NA>   <NA>         2014 <NA>     
##  7 000c5441-… Myzozoa Dinoph… Gymnod… Warnow… Prote… <NA>         2014 <NA>     
##  8 000c8d1b-… Myzozoa Dinoph… <NA>    <NA>    <NA>   <NA>         2016 <NA>     
##  9 000dad20-… <NA>    <NA>    <NA>    <NA>    <NA>   <NA>         2015 <NA>     
## 10 000df653-… Myzozoa Dinoph… Peridi… Protop… Proto… <NA>         2014 <NA>     
## # … with 64,827 more rows, and 119 more variables: project_name <chr>,
## #   experimental_factor <chr>, env_broad_scale <chr>, env_local_scale <chr>,
## #   env_medium <chr>, subspecf_gen_lin <chr>, ploidy <chr>,
## #   num_replicons <chr>, extrachrom_elements <chr>, estimated_size <chr>,
## #   ref_biomaterial <chr>, source_mat_id <chr>, pathogenicity <chr>,
## #   biotic_relationship <chr>, specific_host <chr>, host_spec_range <chr>,
## #   host_disease_stat <chr>, trophic_level <chr>, propagation <chr>, …

Let’s take a look at some properties of the sequence records.

dna %>%
  group_by(target_gene, pcr_primer_name_forward, pcr_primer_name_reverse) %>%
  summarize(records = n()) %>%
  kable()
target_gene pcr_primer_name_forward pcr_primer_name_reverse records
18S 1391f EukBr 64837

Working with DNADerivedData records

This is a quick demo of working with sequences from the DNADerivedData extension. For the remainder of the notebook I will work with a subset of the sequence data.

dna_subset <- dna %>%
  filter(phylum == "Ochrophyta" & !is.na(species)) %>%
  head(100)

To be able to make use of existing bioinformatics packages we first need to convert our DNA_sequence column to a DNAbin object. This can be done using as.DNAbin function from the ape (Analyses of Phylogenetics and Evolution) package.

sequences <- sapply(strsplit(as.character(dna_subset$DNA_sequence), ""), tolower)
names(sequences) <- dna_subset$id
sequences <- ape::as.DNAbin(sequences)
sequences
## 100 DNA sequences in binary format stored in a list.
## 
## Mean sequence length: 126.95 
##    Shortest sequence: 125 
##     Longest sequence: 130 
## 
## Labels:
## 0051d367-7cbd-4b8e-9e2e-10c5ad2331a7
## 005a9b4a-6a8d-41b4-8c12-5f1536ecfec0
## 006c0661-e140-4318-b299-065af8eb1296
## 008f0b80-7f82-40ed-8d4c-e61c5d28dce5
## 008f9442-b53d-49ab-b8e9-5bee228b3525
## 00acb86b-b6b6-41b3-8e2c-dcb5fd8ac00f
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.247 0.197 0.268 0.288 
## (Total: 12.7 kb)

Now we can align the sequences using MAFFT. The ips (Interfaces to Phylogenetic Software in R) package provides an interface to the MAFFT software.

aligned <- ips::mafft(sequences, method = "auto")
aligned
## 100 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 131 
## 
## Labels:
## 0051d367-7cbd-4b8e-9e2e-10c5ad2331a7
## 005a9b4a-6a8d-41b4-8c12-5f1536ecfec0
## 006c0661-e140-4318-b299-065af8eb1296
## 008f0b80-7f82-40ed-8d4c-e61c5d28dce5
## 008f9442-b53d-49ab-b8e9-5bee228b3525
## 00acb86b-b6b6-41b3-8e2c-dcb5fd8ac00f
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.247 0.197 0.268 0.288 
## (Total: 13.1 kb)

Let’s visualize the aligned sequences:

ggmsa::ggmsa(aligned %>% as.list() %>% setNames(substr(names(sequences), 1, 5)), color = "Chemistry_NT", seq_name = TRUE) + theme(text = element_text(size = 5))

Finally, we can calculate a distance matrix and construct a tree using dist.dna and njs.

distances <- ape::dist.dna(aligned, model = "raw")

tree <- ape::njs(distances) %>%
  left_join(dna_subset %>% select(id, year = date_year, species), by = c("label" = "id"))

ggtree(tree, layout = "circular", cex = 0.3) +
  geom_tiplab(size = 2, aes(label = species, color = year)) +
  scale_color_viridis_c(option = "magma", end = 0.9)

Using the mapper

DNADerivedData extension records can also be included in data downloads from the mapper. To do so, tick the DNADerivedData checkbox in the download confirmation form.

mapper

The resulting archive will include CSV files for the occurrences as well as for the selected extensions. The occurrence and extension tables can be joined using the id column.

mapper