A New Backbone Taxonomy
This documentation is only visible to GBIF staff. |
Producing a new names backbone takes several weeks and involves staff from informatics, data products and any others interested in reviewing the proposed backbone.
Reprocessing datasets requires suspending crawling and ingestion, and is therefore disruptive to data publishers.
This procedure builds a new backbone on backbonebuild-vh and exposes the required webservices from there directly to run occurrence processing. We skip reviewing on UAT but instead rely on the taxonomy review tool.
Building a backbone
We use backbonebuild-vh
with its local PostgreSQL database to build a new backbone. We also run the matching and species API from there so we don’t need to copy the database around to a different environment. The configs for backbonebuild-vh
are located in the nub
CLI environment folder.
1. Copy ChecklistBank database from prod
-
If this is a release (or release candidate) backbone build, stop production CLB CLIs.
-
Reconfigure the crawl scheduler (maybe directly on the VM) to exclude
CHECKLIST
type datasets from scheduled crawling, so the queue doesn’t immediately fill up with them. Restart it. -
Set a notification in Contentful that checklist crawling is paused. It is easiest to change an old notification. Set the "End" time a month ahead so it doesn’t disappear if this procedure takes longer than expected.
-
Wait for any running CLB crawls to finish, then stop the CLB CLIs. Use
-d '10 days'
or similar to avoid Nagios alerts.
-
-
Dump prod Checklistbank DB and import into PostgreSQL on backbonebuild-vh as DB clb
-
Restart
backbonebuild-vh
, as strange memory issues have led to it becoming slow over time -
Import the database
-
Run
ANALYZE
in psql after restoring the DB.
-
All the following work is done as crap user on backbonebuild-vh
, mostly in the bin directory:
2. Build Neo4J backbone
-
Review configs at https://github.com/gbif/gbif-configuration/blob/master/cli/nub/config/
-
Run Neo4J NUB build via
./clb-buildnub.sh
-
./start-clb-importer
to automatically insert the Neo4J nub into postgres once the build is done -
./start-clb-analysis
-
./stop-clb
once completed -
archive
nub.log
in a new directory in https://hosted-datasets.gbif.org/datasets/backbone/
3. Rebuild NUB lookup index & mapdb
-
./stop-nub-ws
-
rm -Rf ~/nubidx
-
rm ~/nublookupDB
-
./start-nub-ws.sh
-
wait until the logs indicate the index build was finished (~1h).
This exposes the nub matching service on port 9000: http://backbonebuild-vh.gbif.org:9000/species/match?verbose=true&name=Abies
Testing a backbone
Ask Tim to rebuild the taxonomy review tool files with different cutoffs, usually 10.000 for the global test file and lower ones for specific groups of high interest, e.g. 100 for Fabaceae or 1000 for Fungi. Or follow instructions at https://gist.github.com/timrobertson100/7131122a3723b593c344c43a3bd27777 on how to build them using Spark.
Engage several people in reviewing the backbone candidate including Joe, Thomas, Andrea, Markus, Tim, Cecilie, Marie, John & Tobias
Reprocessing production data
Reprocessing checklists
With a new backbone all checklists must be rematched. The CoL dataset must come first as the metrics make use of it for each other dataset.
The IUCN checklist is queried to assign a Red List status to occurrences, so occurrence processing must not start before it is rematched. |
If not already running start the matcher CLI on backbonebuild-vh:
-
./start-clb-matcher.sh
-
./start-clb-analysis.sh
-
./clb-admin MATCH --col
rematch CoL first so subsequent dataset analysis contains the right CoL percentage coverage -
./clb-admin MATCH --iucn
-
then rematch all the rest:
./clb-admin REMATCH
this takes 10-20h to complete.
Reprocessing occurrences
Processing uses checklistbank-nub-ws on backbonebuild-vh, via the KVS cache. Other use of checklistbank-nub-ws, such as the public API and website, continues to use prodws-vh.
-
Ensure release version of checklistbank-nub-ws is deployed and running on backbonebuild-vh.
-
Activate the new-backbone Varnish configuration by setting
newBackboneInterpretation: True
and running a deploy-prod job with only "Deploy Varnish Configuration" set. This directs match requests from machines used to do reinterpretation to the backbonebuild-vh webservice. The machines affected are listed in Varnish configuration in c-deploy. -
Ensure any occurrence ingestion has completed then stop crawler CLIs.
-
./stop-crawler -d '2 days'
on prodcrawler1-vh, or manually stop each crawler process if necessary to allow a very large dataset to complete. -
Update the Contentful notification to say occurrence processing is suspended.
-
-
truncate_preserve 'name_usage_kv'
, or (if preferred) create a newname_usage_YYYYMMDD_kv
table in HBase and update configurations to use this:create 'name_usage_20210225_kv', {NAME => 'v', BLOOMFILTER => 'ROW', DATA_BLOCK_ENCODING => 'FAST_DIFF', COMPRESSION => 'SNAPPY'},{SPLITS => ['1','2','3','4','5','6','7','8']}
-
Use the Pipelines data reinterpretation workflows in Jenkins (sources) to run the reinterpretation, on the appropriate environment, choosing a new index if required.
-
In UAT, typically we would reinterpret in-place,
-
In production, we increment the index letter and build a new index, then swap over once it is completed.
-
In either case, it’s only necessary to run Taxonomy reinterpretation.
-
As shown in the Jenkins pipeline, it’s necessary to check the new index contains the same number of records as the live index. Once this is complete, use the job to build the Hive tables.
Backfill Occurrence maps
Normally nothing is required — the scheduler can run its course.
If the Hive database is no longer prod_h
the configuration will need to be updated.
Building the new ElasticSearch CLB index
We build a new ES index for prod using oozie and the backbonebuild database, but do not yet change the production alias which will be done when we deploy the services. Verify the prod indexer configs are correct and do use a different alias from the current live one: https://github.com/gbif/gbif-configuration/blob/master/checklistbank-index-builder/prod.properties
-
ssh c5gateway.gbif.org
cd /home/mdoering/checklistbank/checklistbank-workflows git pull ./install-workflow.sh prod token
Going live
Prepare ChecklistBank
TODO: Suggest incrementing prod_checklistbank_2 → prod_checklistbank_3 etc, to avoid any possibility of misconfiguration. Or even prod_checklistbank_2023_01 etc.
-
backbonebuild-vh:
sudo -u postgres pg_dump -U postgres -Fc -Z1 clb > /var/lib/pgsql/11/backups/clb-2021-03-03.dump
-
import dump into prod:
-
scp file to pg1.gbif.org
-
sudo -u postgres pg_restore --clean --dbname prod_checklistbank2 --jobs 8 /var/lib/pgsql/11/backups/clb-2021-03-08.dump
-
sudo -u postgres psql -U clb prod_checklistbank -c 'VACUUM ANALYZE'
-
-
copy NUB index from
backbonebuild-vh:/home/crap/nubidx
tows.gbif.org:/usr/local/gbif/services/checklistbank-nub-ws/nubidxNEW
-
copy nublookupDB index from
backbonebuild-vh:/home/crap/nublookupDB
tows.gbif.org:/usr/local/gbif/services/checklistbank-nub-ws/nublookupDBNEW
-
update webservice configs
Deploy CLB WS
-
Deactivate the new-backbone Varnish configuration by setting
newBackboneInterpretation: False
-
Deploy checklistbank-nub-ws to production
-
swap the NUB index within checklistbank-nub-ws:
systemctl stop checklistbank-nub-ws rm -Rf /usr/local/gbif/services/checklistbank-nub-ws/nubidx mv /usr/local/gbif/services/checklistbank-nub-ws/nubidxNEW /usr/local/gbif/services/checklistbank-nub-ws/nubidx systemctl start checklistbank-nub-ws
-
prod deploy of checklistbank-ws
-
alias to new solr collection TODO update for ES
systemctl stop checklistbank-ws curl -s "http://c5n1.gbif.org:8983/solr/admin/collections?action=CREATEALIAS&name=prod_checklistbank&collections=prod_checklistbank_2017_02_22" systemctl start checklistbank-ws
Switch to the new occurrence index, downloads ElasticSearch
Run the 4-pipelines-go-live job.
Resuming crawling
TODO — Markus to verify please
-
This is not urgent, and once crawling is resumed it is difficult to roll back. Verify all looks good on the website, and allow other staff time to check their favourite taxa.
-
Remove the old ES index with the 5-pipelines-clean-previous-attempt job.
-
Verify CLI configurations
-
Restart crawling
-
Remove the Contentful notification.
Exporting the new backbone
Update prod backbone metadata
This updates the prod registry! Only do this once the new backbone is live.
This needs to be done before the DWCA export though, as that should include the updated metadata
-
./clb-admin UPDATE_NUB_DATASET
Export backbone DwC-A
-
import from backbonebuild-vh clb DB dump
-
export NUB to DWCA:
./clb-admin EXPORT --nub
-
move to https://hosted-datasets.gbif.org/datasets/backbone/yyyy-mm-dd
Export backbone CSV
-
export CSV from postgres:
\copy (select * from v_backbone) to 'simple.txt'
-
gzip and move to move to https://hosted-datasets.gbif.org/datasets/backbone/yyyy-mm-dd
-
explain the changes in a document at backbone-builds for use in the release notes.