HBase

This documentation is only visible to GBIF staff.

We use Apache HBase for mapping from occurrence identifiers to the GBIF occurrence key (key lookup), storing the next available GBIF occurrence key (counter), occurrence fragments, the Thumbor image cache and several interpretation caches.

The identifier-to-key (lookup) table is of critical importance, as with this and the original crawl data we can reinterpret and rebuild the occurrence store.

HBase backups

Only the lookup and counter tables are backed up, prod_h_occurrence_lookup and prod_h_occurrence_counter.

A cronjob on hdfsbackup@c5gateway-vh.gbif.org (to be replaced with root@backups.gbif.org) uses rclone to mirror the HBase data directory from HDFS to NFS storage. The NFS storage is backed up using Borg to KU’s system, and the disaster recovery server in the ZM garage.

00   *  * * * rclone sync --exclude '.tmp/**' C5:/hbase/data/default/prod_h_occurrence_counter/ /mnt/auto/cluster_export/ingest/prod_h_occurrence_counter/
01   *  * * * rclone sync --exclude '.tmp/**' C5:/hbase/data/default/prod_h_occurrence_lookup/ /mnt/auto/cluster_export/ingest/prod_h_occurrence_lookup/

Restoring from a backup

Copy the data to HDFS on the cluster using any appropriate method:

rclone sync --progress /mnt/auto/cluster_export/ingest/prod_h_occurrence_lookup/ c3:/tmp/lookup-snapshot/

Delete the unnecessary files from HDFS and rearrange so the HFiles are all in a subdirectory named o, the column family:

sudo -u hdfs hdfs dfs -rm -r /tmp/lookup-snapshot/*/recovered.edits /tmp/lookup-snapshot/*/.regioninfo
sudo -u hdfs hdfs dfs -mkdir -p /tmp/lookup-snapshot/o
sudo -u hdfs hdfs dfs -mv /tmp/lookup-snapshot/*/o/* /tmp/lookup-snapshot/o
sudo -u hdfs hdfs dfs -rmdir /tmp/lookup-snapshot/*/o

Create a new table in HBase:

create 'prod_X_occurrence_lookup',
  {NAME => 'o', VERSIONS => 1, COMPRESSION => 'SNAPPY', DATA_BLOCK_ENCODING => 'FAST_DIFF', BLOOMFILTER => 'ROW'},
  {SPLITS => [
    '01','02','03','04','05','06','07','08','09','10',
    '11','12','13','14','15','16','17','18','19','20',
    '21','22','23','24','25','26','27','28','29','30',
    '31','32','33','34','35','36','37','38','39','40',
    '41','42','43','44','45','46','47','48','49','50',
    '51','52','53','54','55','56','57','58','59','60',
    '61','62','63','64','65','66','67','68','69','70',
    '71','72','73','74','75','76','77','78','79','80',
    '81','82','83','84','85','86','87','88','89','90',
    '91','92','93','94','95','96','97','98','99'
  ]}

create 'prod_X_occurrence_counter', {NAME => 'o', BLOOMFILTER => 'ROW', COMPRESSION => 'SNAPPY'}

Load the HFiles into HBase:

sudo -u hdfs hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no hdfs://ha-nn/tmp/lookup-snapshot prod_X_occurrence_lookup

Check the count is reasonable:

hbase org.apache.hadoop.hbase.mapreduce.RowCounter -Dhbase.client.scanner.caching=100 -Dmapreduce.map.speculative=false 'prod_X_occurrence_lookup'