A New Backbone Taxonomy

This documentation is only visible to GBIF staff.

Producing a new names backbone takes several weeks and involves staff from informatics, data products and any others interested in reviewing the proposed backbone.

Reprocessing datasets requires suspending crawling and ingestion, and is therefore disruptive to data publishers.

This procedure builds a new backbone on backbonebuild-vh and exposes the required webservices from there directly to run occurrence processing. We skip reviewing on UAT but instead rely on the taxonomy review tool.

Building a backbone

We use backbonebuild-vh with its local PostgreSQL database to build a new backbone. We also run the matching and species API from there so we don’t need to copy the database around to a different environment. The configs for backbonebuild-vh are located in the nub CLI environment folder.

1. Copy ChecklistBank database from prod

  • If this is a release (or release candidate) backbone build, stop production CLB CLIs.

    1. Reconfigure the crawl scheduler (maybe directly on the VM) to exclude CHECKLIST type datasets from scheduled crawling, so the queue doesn’t immediately fill up with them. Restart it.

    2. Set a notification in Contentful that checklist crawling is paused. It is easiest to change an old notification. Set the "End" time a month ahead so it doesn’t disappear if this procedure takes longer than expected.

    3. Wait for any running CLB crawls to finish, then stop the CLB CLIs. Use -d '10 days' or similar to avoid Nagios alerts.

  • Dump prod Checklistbank DB and import into PostgreSQL on backbonebuild-vh as DB clb

    • Restart backbonebuild-vh, as strange memory issues have led to it becoming slow over time

    • Import the database

    • Run ANALYZE in psql after restoring the DB.

All the following work is done as crap user on backbonebuild-vh, mostly in the bin directory:

2. Build Neo4J backbone

3. Rebuild NUB lookup index & mapdb

  • ./stop-nub-ws

  • rm -Rf ~/nubidx

  • rm ~/nublookupDB

  • ./start-nub-ws.sh

  • wait until the logs indicate the index build was finished (~1h).

This exposes the nub matching service on port 9000: http://backbonebuild-vh.gbif.org:9000/species/match?verbose=true&name=Abies

Testing a backbone

Ask Tim to rebuild the taxonomy review tool files with different cutoffs, usually 10.000 for the global test file and lower ones for specific groups of high interest, e.g. 100 for Fabaceae or 1000 for Fungi. Or follow instructions at https://gist.github.com/timrobertson100/7131122a3723b593c344c43a3bd27777 on how to build them using Spark.

Engage several people in reviewing the backbone candidate including Joe, Thomas, Andrea, Markus, Tim, Cecilie, Marie, John & Tobias

Reprocessing production data

Reprocessing checklists

With a new backbone all checklists must be rematched. The CoL dataset must come first as the metrics make use of it for each other dataset.

The IUCN checklist is queried to assign a Red List status to occurrences, so occurrence processing must not start before it is rematched.

If not already running start the matcher CLI on backbonebuild-vh:

  • ./start-clb-matcher.sh

  • ./start-clb-analysis.sh

  • ./clb-admin MATCH --col rematch CoL first so subsequent dataset analysis contains the right CoL percentage coverage

  • ./clb-admin MATCH --iucn

  • then rematch all the rest: ./clb-admin REMATCH this takes 10-20h to complete.

Reprocessing occurrences

Processing uses checklistbank-nub-ws on backbonebuild-vh, via the KVS cache. Other use of checklistbank-nub-ws, such as the public API and website, continues to use prodws-vh.

  • Ensure release version of checklistbank-nub-ws is deployed and running on backbonebuild-vh.

  • Activate the new-backbone Varnish configuration by setting newBackboneInterpretation: True and running a deploy-prod job with only "Deploy Varnish Configuration" set. This directs match requests from machines used to do reinterpretation to the backbonebuild-vh webservice. The machines affected are listed in Varnish configuration in c-deploy.

  • Ensure any occurrence ingestion has completed then stop crawler CLIs.

    • ./stop-crawler -d '2 days' on prodcrawler1-vh, or manually stop each crawler process if necessary to allow a very large dataset to complete.

    • Update the Contentful notification to say occurrence processing is suspended.

  • truncate_preserve 'name_usage_kv', or (if preferred) create a new name_usage_YYYYMMDD_kv table in HBase and update configurations to use this:

    create 'name_usage_20210225_kv', {NAME => 'v', BLOOMFILTER => 'ROW', DATA_BLOCK_ENCODING => 'FAST_DIFF', COMPRESSION => 'SNAPPY'},{SPLITS => ['1','2','3','4','5','6','7','8']}
  • Use the Pipelines data reinterpretation workflows in Jenkins (sources) to run the reinterpretation, on the appropriate environment, choosing a new index if required.

    • In UAT, typically we would reinterpret in-place,

    • In production, we increment the index letter and build a new index, then swap over once it is completed.

    • In either case, it’s only necessary to run Taxonomy reinterpretation.

As shown in the Jenkins pipeline, it’s necessary to check the new index contains the same number of records as the live index. Once this is complete, use the job to build the Hive tables.

Backfill Occurrence maps

Normally nothing is required — the scheduler can run its course.

If the Hive database is no longer prod_h the configuration will need to be updated.

Building the new ElasticSearch CLB index

We build a new ES index for prod using oozie and the backbonebuild database, but do not yet change the production alias which will be done when we deploy the services. Verify the prod indexer configs are correct and do use a different alias from the current live one: https://github.com/gbif/gbif-configuration/blob/master/checklistbank-index-builder/prod.properties

  • ssh c5gateway.gbif.org

cd /home/mdoering/checklistbank/checklistbank-workflows
git pull
./install-workflow.sh prod token

Checks before making the new backbone live

TODO: What checks?

Going live

Prepare ChecklistBank

TODO: Suggest incrementing prod_checklistbank_2 → prod_checklistbank_3 etc, to avoid any possibility of misconfiguration. Or even prod_checklistbank_2023_01 etc.

Deploy CLB WS

  • Deactivate the new-backbone Varnish configuration by setting newBackboneInterpretation: False

  • Deploy checklistbank-nub-ws to production

  • swap the NUB index within checklistbank-nub-ws:

    systemctl stop checklistbank-nub-ws
    rm -Rf /usr/local/gbif/services/checklistbank-nub-ws/nubidx
    mv /usr/local/gbif/services/checklistbank-nub-ws/nubidxNEW /usr/local/gbif/services/checklistbank-nub-ws/nubidx
    systemctl start checklistbank-nub-ws
  • prod deploy of checklistbank-ws

  • alias to new solr collection TODO update for ES

    systemctl stop checklistbank-ws
    curl -s "http://c5n1.gbif.org:8983/solr/admin/collections?action=CREATEALIAS&name=prod_checklistbank&collections=prod_checklistbank_2017_02_22"
    systemctl start checklistbank-ws

Switch to the new occurrence index, downloads ElasticSearch

Run the 4-pipelines-go-live job.

Resuming crawling

TODO — Markus to verify please

  • This is not urgent, and once crawling is resumed it is difficult to roll back. Verify all looks good on the website, and allow other staff time to check their favourite taxa.

  • Remove the old ES index with the 5-pipelines-clean-previous-attempt job.

  • Verify CLI configurations

  • Restart crawling

  • Remove the Contentful notification.

Exporting the new backbone

Update prod backbone metadata

This updates the prod registry! Only do this once the new backbone is live.

This needs to be done before the DWCA export though, as that should include the updated metadata

  • ./clb-admin UPDATE_NUB_DATASET

Export backbone DwC-A

Export backbone CSV