Data Preparation

Things to consider

  • create a local identifier if not existing

  • create full dwc:scientificName including authorship

  • create decimal coordinates & precision

Database Source

  • setup a SQL view to use functions (can also be done in IPT SQL source definition)

    • concatenation, splitting of strings: e.g. build full scientific name (watchout autonyms)

    • format dates as ISO

    • create year/month/day by parsing native SQL date types

  • use a UNION to merge 2 or more tables, e.g. accepted taxa and synonyms or specimen and observations

  • select fixed values (prefer to do this in IPT mapping)

Text Files Source

  • convert to UTF8

  • use standard CSV (i.e. delimiter=, quotation=") or tab files

  • make sure you have replaced line breaks, i.e. \r \n or \r\n with either simple spaces or use 2 characters \r to escape the line break if the intention is to preserve them

  • encode nulls as empty fields, i.e. no characters between 2 delimiters, not \N or \NULL

Utility: Character encoding converter - iconv

Simple tool for Linux and Windows to convert character encodings of files.

Examples:

  • convert character encodings from Windows-1252 to UTF-8 using iconv

  • iconv -f CP1252 -t utf-8 example.txt > exampleUTF8.txt

Utility: Unix Stream Editor, SED

A Unix command line tool to manipulate files as streams, thereby allowing to modify very large files without the need to load them into memory first (this is what pretty much all editors apart from few, e.g. vi, do)