Darwin Core

Darwin Core is a TDWG standard, which is based on the ideas of the popular terms from the Dublin Core Metadata Initiative. A fundamental principle of the Darwin Core as a library of terms is to keep the definition of terms distinct from the technology used to share them, e.g. XML or RDF.

IPT and Darwin Core

The IPT has core biodiversity data types built-in, which are based on Darwin Core (DwC) terms. The dataset types are Occurrence, Checklist, and Metadata records, each of which has a fixed set of terms to describe it.

Darwin Core History

Until the ratification of Darwin Core as a standard it was used to describe primary species occurrence data, in particular through DiGIR and XML encoding. When we were looking for a very simple checklist data exchange format and with the rise of tagging of species on Flickr, it became apparent that simple terms for biodiversity in the tradition of Dublin Core would be very useful - and indeed very much overlapping with the Darwin Core terms in use already.

Terms

All Darwin Core terms are defined in Darwin Core Terms: a quick reference guide.

A single DwC term, in IPT often called a property, can be used once for each record. Generally it is free text, but the definition often recommends certain formats or vocabularies to use, e.g. the ISO 2 letter country codes for the dwc:countryCode term.

Patterns

ID terms

DwC provides many of terms for identifiers. Some can be used to define a record (such as occurrenceID for an Occurrence record; taxonID for a Taxon record), while others (such as higherGeographyID) refer to an identifier for information stored outside the record. For example namePublishedInID is used to refer to an identifier (perhaps a DOI or other resolvable identifier) for the publication in which a scientificName was originally established. Note that taxonID used within an occurrence dataset would function as a pointer to a taxon defined somewhere else, such as in a checklist dataset, while taxonID within a Taxon record would act as the identifier for that record.

Most ID terms have a corresponding full text term, e.g. acceptedNameUsageID and acceptedNameUsage. These serve two purposes:

  1. In the absence of an identifier they can be used to refer to another record, in this case the accepted/valid taxon.

  2. They provide a human readable context that persists even if the identifier cannot be resolved

It therefore makes sense to provide both if possible.

Denormalized Hierarchies

The geography and taxonomy can be expressed as a flexible hierarchy of places or taxa through the terms higherParentNameUsage(ID) and higherGeography(ID). In addition to this adjacency list , the most popular ranks can be published as a denormalized hierarchy for each record, effectively repeating this information across many records. But it does provide a quick, short and human readable classification for each record in isolation of the entire dataset.

  • Taxonomic denormalized classification: kingdom, phylum, class, order, family, genus, subgenus

  • Geographic denormalized classification: continent, waterBody, islandGroup, island, country / countryCode, stateProvince, county, municipality

As with full text ID terms above this introduces the possibility of data integrity problems, as the ID term might resolve into something different than the denormalized hierarchy. In this case the IPT follows the recommendation of the following precedence of terms for resolving the hierarchy:

ID term >> Text term >> Denormalized term
higherTaxonID >> higherTaxon >> kingdom,family,...

Verbatim terms

Quite a few terms have a corresponding verbatim term. This is to cater the publication of the exact verbatim transcription of certain attributes as they were found in the underlying specimen label, observation fieldbook or literature. This way the verbatimEventDate can be used to publish the exact transcription of the collecting date, while eventDate can be encoded in a standard ISO date time representation.

Primary data

All DwC terms can be used to describe an occurrence record. It is recommended to publish at least the following terms. Terms flagged with !!! have to be present to be recognized by the current GBIF indexing:

Example

occurrenceID=96db9d09-596d-409c-8626-f4460078d0eb
institutionCode=BGBM
collectionCode=B
basisOfRecord=preservedspecimen
catalogNumber=1159
eventDate=1999-08-06 00:00:00.0
collector=Markus Döring
continent=Asia
country=TR
stateProvince=Adana
locality=Aladaglari, lower Narpiz Deresi, next to fountain, 2900m
minimumElevationInMeters=2900
decimalLatitude=37.82800
decimalLongitude=35.13600
geodeticDatum=WGS84
identifiedBy=Markus Döring
scientificName=Festuca anatolica subsp. anatolica
kingdom=Plantae
phylum=Magnoliophyta
class=
order=Cyperales
family=Poaceae
genus=Festuca
specificEpithet=anatolica
infraspecificEpithet=anatolica

Checklists

Checklists are confined to ± the taxonomic subset of all Darwin Core terms.

The Darwin Core Archive

Darwin Core Archives (DwC-A) are the new, primary means of publishing data to the GBIF network. They contain an entire dataset, are based on simple text files and can be created fairly easily without the IPT with custom software.

Darwin Core Extensions

Recognizing that DwC only covers the core biodiversity metadata, extensions to Darwin core are a common need across all communities. The simplest way to do so is to create new terms in a new namespace and simply extend a regular dwc record with these terms.

Often multiple subrecords for an extension is desired, such as many common names for a species or multiple images for a specimen. In order to share these richer, related records the star scheme is used, whereby an extension consists of multiple records, each linked to a core dwc record. Any number of extension records potentially from different extensions (e.g. images & identification) for a single core record is possible.

The Archive Format

The Darwin Core Archive format provides a means to publish dwc records plus extensions in a relatively simple, text-based format. A Darwin Core Archive consists of a set of text files that are bundled into a common package and then zipped into a single archive file. The format follows the Darwin Core text guidelines. A typical package is illustrated in the diagram below and consists of components described in details here.

dwca