Data Preparation :: GBIF IPT User Manual

Things to consider

create a local identifier if not existing
create full dwc:scientificName including authorship
create decimal coordinates & precision

Database Source

setup a SQL view to use functions (can also be done in IPT SQL source definition)
- concatenation, splitting of strings: e.g. build full scientific name (watchout autonyms)
- format dates as ISO
- create year/month/day by parsing native SQL date types
use a UNION to merge 2 or more tables, e.g. accepted taxa and synonyms or specimen and observations
select fixed values (prefer to do this in IPT mapping)

Text Files Source

convert to UTF8
use standard CSV (i.e. delimiter=, quotation=") or tab files
make sure you have replaced line breaks, i.e. \r \n or \r\n with either simple spaces or use 2 characters \r to escape the line break if the intention is to preserve them
encode nulls as empty fields, i.e. no characters between 2 delimiters, not \N or \NULL

Utility: Character encoding converter - iconv

Simple tool for Linux and Windows to convert character encodings of files.

Examples:

convert character encodings from Windows-1252 to UTF-8 using iconv

iconv -f CP1252 -t utf-8 example.txt > exampleUTF8.txt

Utility: Unix Stream Editor, SED

A Unix command line tool to manipulate files as streams, thereby allowing to modify very large files without the need to load them into memory first (this is what pretty much all editors apart from few, e.g. vi, do)

http://www.unixguide.net/unix/sedoneliner.shtml
http://www.brunolinux.com/02-The_Terminal/Find_and%20Replace_with_Sed.html
replace in place and create backup copy
```
sed -i.old "s/\\\\N//g" allNames.txt
```
convert DOS newlines (CR/LF) to Unix format:
```
sed 's/.$//'
```