DNADerivedData extension data access
In this notebook we will explore data published to OBIS using the new DNADerivedData extension. This notebook is available at https://iobis.github.io/notebook-dnaderiveddata/.
Using R
Installing the robis package
First install the robis package. If the package is already installed, make sure the version is 2.10.0 or higher.
if (!"robis" %in% rownames(installed.packages())) {
remotes::install_github("iobis/robis")
} else if (packageVersion("robis") < "2.10.0") {
knitr::knit_exit()
}
We will also need the following packages:
Finding sequence based data
Before we fetch any occurrence data let’s first find out which datasets in OBIS are using the DNADerivedData extension. The dataset()
function takes many of the same parameters as the occurrence()
function, including hasextensions
which allows us to limit our search to occurrences which have linked extension records of a specific type. Possible values include MeasurementOrFact
and DNADerivedData
.
dna_datasets <- dataset(hasextensions = "DNADerivedData")
dna_datasets %>%
select(url, title, records) %>%
kable()
url | title | records |
---|---|---|
https://www1.usgs.gov/obis-usa/ipt/resource?r=18s_monterey_bay_time_series_edna | 18S Monterey Bay Time Series: an eDNA data set from Monterey Bay, California, including years 2006, 2013 - 2016 | 64837 |
Fetching sequence based data
Let’s fetch all occurrences from the 18S Monterey Bay Time Series dataset by using the datasetid
parameter. Alternatively, set hasextensions
to DNADerivedData
. Also set the extensions
parameter to ensure that the nested extension records are included in the results.
occ <- occurrence(datasetid = "62b97724-da17-4ca7-9b26-b2a22aeaab51", extensions = "DNADerivedData")
occ
## # A tibble: 64,837 × 93
## eventID brackish date_year scientificNameID scientificName infrakingdomid
## <chr> <lgl> <int> <chr> <chr> <int>
## 1 CN13Dc01… FALSE 2013 urn:lsid:marinesp… Cercozoa 582420
## 2 32414c01… NA 2014 urn:lsid:marinesp… Dolichomastig… NA
## 3 CN13Dc01… NA 2013 urn:lsid:marinesp… Hemiaulaceae 368898
## 4 18815c01… FALSE 2015 urn:lsid:marinesp… Spirotrichea 536209
## 5 11216c01… TRUE 2016 urn:lsid:marinesp… Biota NA
## 6 05114c01… FALSE 2014 urn:lsid:marinesp… Cercozoa 582420
## 7 32414c01… NA 2014 urn:lsid:marinesp… Proterythrops… 536209
## 8 11216c01… TRUE 2016 urn:lsid:marinesp… Dinophyceae 536209
## 9 28215c01… TRUE 2015 urn:lsid:marinesp… Biota NA
## 10 30214c01… TRUE 2014 urn:lsid:marinesp… Protoperidini… 536209
## # … with 64,827 more rows, and 87 more variables: absence <lgl>, dropped <lgl>,
## # aphiaID <int>, decimalLatitude <dbl>, taxonID <chr>,
## # originalScientificName <chr>, marine <lgl>, phylumid <int>,
## # basisOfRecord <chr>, terrestrial <lgl>, subkingdom <chr>, date_mid <dbl>,
## # identificationRemarks <chr>, nameAccordingTo <chr>, id <chr>,
## # identificationReferences <chr>, organismQuantity <chr>,
## # sampleSizeUnit <chr>, dataset_id <chr>, decimalLongitude <dbl>, …
Let’s take a look at the taxonomic composition of our dataset:
occ %>%
group_by(phylum) %>%
summarize(species = length(unique(speciesid)), records = n()) %>%
arrange(desc(species))
## # A tibble: 42 × 3
## phylum species records
## <chr> <int> <int>
## 1 Myzozoa 60 21267
## 2 Ochrophyta 46 6285
## 3 Ciliophora 16 3425
## 4 Chlorophyta 15 2540
## 5 Cryptophyta 12 1424
## 6 Haptophyta 10 1442
## 7 Cnidaria 8 202
## 8 Arthropoda 7 848
## 9 Cercozoa 7 3279
## 10 Annelida 6 240
## # … with 32 more rows
Extracting DNADerivedData records
The DNADerivedData records are currently nested within the occurrences we fetched from OBIS. To extract the extension records we can use the unnest_extension()
function. Pass any fields from the occurrence table you would like to keep using the fields
parameter.
dna <- unnest_extension(occ, "DNADerivedData", fields = c("id", "phylum", "class", "order", "family", "genus", "species", "date_year"))
dna
## # A tibble: 64,837 × 128
## id phylum class order family genus species date_year samp_name
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr>
## 1 00015cdd-… Cercoz… <NA> <NA> <NA> <NA> <NA> 2013 <NA>
## 2 00032b86-… Chloro… Mamiel… Dolich… <NA> <NA> <NA> 2014 <NA>
## 3 00069a4d-… Ochrop… Bacill… Hemiau… Hemiau… <NA> <NA> 2013 <NA>
## 4 0006bc59-… Ciliop… Spirot… <NA> <NA> <NA> <NA> 2015 <NA>
## 5 000a11ec-… <NA> <NA> <NA> <NA> <NA> <NA> 2016 <NA>
## 6 000a8381-… Cercoz… <NA> <NA> <NA> <NA> <NA> 2014 <NA>
## 7 000c5441-… Myzozoa Dinoph… Gymnod… Warnow… Prote… <NA> 2014 <NA>
## 8 000c8d1b-… Myzozoa Dinoph… <NA> <NA> <NA> <NA> 2016 <NA>
## 9 000dad20-… <NA> <NA> <NA> <NA> <NA> <NA> 2015 <NA>
## 10 000df653-… Myzozoa Dinoph… Peridi… Protop… Proto… <NA> 2014 <NA>
## # … with 64,827 more rows, and 119 more variables: project_name <chr>,
## # experimental_factor <chr>, env_broad_scale <chr>, env_local_scale <chr>,
## # env_medium <chr>, subspecf_gen_lin <chr>, ploidy <chr>,
## # num_replicons <chr>, extrachrom_elements <chr>, estimated_size <chr>,
## # ref_biomaterial <chr>, source_mat_id <chr>, pathogenicity <chr>,
## # biotic_relationship <chr>, specific_host <chr>, host_spec_range <chr>,
## # host_disease_stat <chr>, trophic_level <chr>, propagation <chr>, …
Let’s take a look at some properties of the sequence records.
dna %>%
group_by(target_gene, pcr_primer_name_forward, pcr_primer_name_reverse) %>%
summarize(records = n()) %>%
kable()
target_gene | pcr_primer_name_forward | pcr_primer_name_reverse | records |
---|---|---|---|
18S | 1391f | EukBr | 64837 |
Working with DNADerivedData records
This is a quick demo of working with sequences from the DNADerivedData extension. For the remainder of the notebook I will work with a subset of the sequence data.
To be able to make use of existing bioinformatics packages we first need to convert our DNA_sequence
column to a DNAbin
object. This can be done using as.DNAbin
function from the ape
(Analyses of Phylogenetics and Evolution) package.
sequences <- sapply(strsplit(as.character(dna_subset$DNA_sequence), ""), tolower)
names(sequences) <- dna_subset$id
sequences <- ape::as.DNAbin(sequences)
sequences
## 100 DNA sequences in binary format stored in a list.
##
## Mean sequence length: 126.95
## Shortest sequence: 125
## Longest sequence: 130
##
## Labels:
## 0051d367-7cbd-4b8e-9e2e-10c5ad2331a7
## 005a9b4a-6a8d-41b4-8c12-5f1536ecfec0
## 006c0661-e140-4318-b299-065af8eb1296
## 008f0b80-7f82-40ed-8d4c-e61c5d28dce5
## 008f9442-b53d-49ab-b8e9-5bee228b3525
## 00acb86b-b6b6-41b3-8e2c-dcb5fd8ac00f
## ...
##
## Base composition:
## a c g t
## 0.247 0.197 0.268 0.288
## (Total: 12.7 kb)
Now we can align the sequences using MAFFT. The ips
(Interfaces to Phylogenetic Software in R) package provides an interface to the MAFFT software.
## 100 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 131
##
## Labels:
## 0051d367-7cbd-4b8e-9e2e-10c5ad2331a7
## 005a9b4a-6a8d-41b4-8c12-5f1536ecfec0
## 006c0661-e140-4318-b299-065af8eb1296
## 008f0b80-7f82-40ed-8d4c-e61c5d28dce5
## 008f9442-b53d-49ab-b8e9-5bee228b3525
## 00acb86b-b6b6-41b3-8e2c-dcb5fd8ac00f
## ...
##
## Base composition:
## a c g t
## 0.247 0.197 0.268 0.288
## (Total: 13.1 kb)
Let’s visualize the aligned sequences:
ggmsa::ggmsa(aligned %>% as.list() %>% setNames(substr(names(sequences), 1, 5)), color = "Chemistry_NT", seq_name = TRUE) + theme(text = element_text(size = 5))
Finally, we can calculate a distance matrix and construct a tree using dist.dna
and njs
.
distances <- ape::dist.dna(aligned, model = "raw")
tree <- ape::njs(distances) %>%
left_join(dna_subset %>% select(id, year = date_year, species), by = c("label" = "id"))
ggtree(tree, layout = "circular", cex = 0.3) +
geom_tiplab(size = 2, aes(label = species, color = year)) +
scale_color_viridis_c(option = "magma", end = 0.9)
Using the mapper
DNADerivedData extension records can also be included in data downloads from the mapper. To do so, tick the DNADerivedData checkbox in the download confirmation form.
The resulting archive will include CSV files for the occurrences as well as for the selected extensions. The occurrence and extension tables can be joined using the id
column.