Reading a GBIF download using Hive

This documentation is only visible to GBIF staff.

Sometimes we would like to process an existing, large GBIF download using Hive.

Since the files are zipped, there’s no fast way to achieve this.^[1]

How large?

The unzip and zipinfo tools included with CentOS 7 aren’t capable of listing such large archives. Use Java’s jar command instead.

$ jar tvf /mnt/auto/occurrence-download/0015281-230224095556074.zip
1221527907781 Wed Mar 01 07:36:14 UTC 2023 0015281-230224095556074.csv

Unpack to the local filesystem

This doesn’t actually work, due to a bug with libarchive (i.e. bsdtar).

TODO, Matt: Install a compatible streaming decompression tool on the gateway machines.

$ dl=0015281-230224095556074
$ size=$(hdfs dfs -du /occurrence-download/prod-downloads/$dl.zip | cut -d' ' -f 1)
$ mkdir $dl
$ cd $dl
$ hdfs dfs -cat /occurrence-download/prod-downloads/$dl.zip | \
    bsdtar -xf - --to-stdout $dl.csv | \
    pv -s $size | \ (1)
    split --suffix-length=6 --numeric-suffixes --line-bytes=1024M --filter='gzip > $FILE.gz' - $dl-part-

1	This is the size from above. Skip the whole line if you don’t want a progress bar, or remove `-s 1221527907781` to just see progress without a total.

This streams the file, unzips the data and splits it into 1GB (uncompressed) chunks. Each chunk is compressed using gzip before being written to the local filesystem. If you won’t use Hadoop, faster and smaller compression can be done with --filter='zstd > $FILE.zst'.

Unpack directly to HDFS

The fastest way is to upload the files without compression and without HDFS replication. This takes about #3 hours for a monthly simple CSV download:#

The best way to upload a zipped GBIF download to HDFS is as follows, streaming the file from HDFS, decompressing it on the gateway machine, and uploading compressed chunks back to HDFS.

The compression (gzip --fast | ` and `.gz) is optional, but saves over TODO% of the disk space on HDFS, and (since reading from disk is much slower than decompressing) makes queries about TODO% faster. However, the initial upload takes about TODO as long.

Note the -l flag is setting replication to 1.

SIMPLE_CSV download

$ dl=0015281-230224095556074
$ size=$(hdfs dfs -du /occurrence-download/prod-downloads/$dl.zip | cut -d' ' -f 1)
$ hdfs dfs -mkdir /user/mblissett/$dl
$ hdfs dfs -cat /occurrence-download/prod-downloads/$dl.zip | \
    java -jar /home/mblissett/gbif-common-0.61-SNAPSHOT-jar-with-dependencies.jar $dl.csv | \
    pv -s $size | \
    split --suffix-length=6 --numeric-suffixes --line-bytes=1024M \
      --filter='gzip --fast | hdfs dfs -put -l - /user/mblissett/'$dl'/$FILE.gz; echo Uploaded part /user/mblissett/'$dl'/$FILE.gz' - $dl-part-

DWCA download

$ dl=0133998-220831081235567
$ occ_size=$(jar -tvf /mnt/auto/occurrence-download/$dl.zip occurrence.txt | cut -d' ' -f 1)
$ verb_size=$(jar -tvf /mnt/auto/occurrence-download/$dl.zip verbatim.txt | cut -d' ' -f 1)
$ hdfs dfs -rm /user/mblissett/$dl ## DANGEROUS
$ hdfs dfs -mkdir /user/mblissett/$dl /user/mblissett/$dl/occurrence /user/mblissett/$dl/verbatim
$ hdfs dfs -cat /occurrence-download/prod-downloads/$dl.zip | \
    java -jar /home/mblissett/gbif-common-0.61-SNAPSHOT-jar-with-dependencies.jar occurrence.txt | \
    pv -s $occ_size | \
    split --suffix-length=6 --numeric-suffixes --line-bytes=1024M \
      --filter='gzip --fast | hdfs dfs -put -l - /user/mblissett/'$dl'/occurrence/$FILE.gz; echo Uploaded part /user/mblissett/'$dl'/occurrence/$FILE.gz' - $dl-occ-
$ hdfs dfs -cat /occurrence-download/prod-downloads/$dl.zip | \
    java -jar /home/mblissett/gbif-common-0.61-SNAPSHOT-jar-with-dependencies.jar verbatim.txt | \
    pv -s $verb_size | \
    split --suffix-length=6 --numeric-suffixes --line-bytes=1024M \
      --filter='gzip --fast | hdfs dfs -put -l - /user/mblissett/'$dl'/verbatim/$FILE.gz; echo Uploaded part /user/mblissett/'$dl'/verbatim/$FILE.gz' - $dl-verb-

Create a Hive table

Gzip or plain text files can be read as an external table in Hive.

For the column specification, try:

$ curl -Ss https://api.gbif.org/v1/occurrence/download/describe/simpleCsv | \
    jq -r '.fields[] | (.name + " STRING,")' | tr '\n' ' '

Then create the table in Hive:

Simple download format

CREATE EXTERNAL TABLE matt.dl_0015281_230224095556074 (
    gbifid STRING, datasetkey STRING, occurrenceid STRING, kingdom STRING, phylum STRING, class STRING,
    order_ STRING, family STRING, genus STRING, species STRING, infraspecificepithet STRING, taxonrank STRING,
    scientificname STRING, v_scientificname STRING, v_scientificnameauthorship STRING, countrycode STRING,
    locality STRING, stateprovince STRING, occurrencestatus STRING, individualcount STRING, publishingorgkey STRING,
    decimallatitude STRING, decimallongitude STRING, coordinateuncertaintyinmeters STRING, coordinateprecision STRING,
    elevation STRING, elevationaccuracy STRING, depth STRING, depthaccuracy STRING, eventdate STRING, day STRING,
    month STRING, year STRING, taxonkey STRING, specieskey STRING, basisofrecord STRING, institutioncode STRING,
    collectioncode STRING, catalognumber STRING, recordnumber STRING, identifiedby STRING, dateidentified STRING,
    license STRING, rightsholder STRING, recordedby STRING, typestatus STRING, establishmentmeans STRING,
    lastinterpreted STRING, mediatype STRING, issue STRING
  )
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  STORED AS TEXTFILE
  LOCATION 'hdfs://ha-nn/user/mblissett/0015281-230224095556074'
  TBLPROPERTIES ('skip.header.line.count'='1')
  ;

Darwin Core Archive download format (occurrence table)

CREATE EXTERNAL TABLE matt.dl_0015281_230224095556074 (
    gbifID STRING, abstract STRING, accessRights STRING, accrualMethod STRING, accrualPeriodicity STRING, accrualPolicy STRING,
    alternative STRING, audience STRING, available STRING, bibliographicCitation STRING, conformsTo STRING, contributor STRING,
    coverage STRING, created STRING, creator STRING, date STRING, dateAccepted STRING, dateCopyrighted STRING, dateSubmitted STRING,
    description STRING, educationLevel STRING, extent STRING, format STRING, hasFormat STRING, hasPart STRING, hasVersion STRING,
    identifier STRING, instructionalMethod STRING, isFormatOf STRING, isPartOf STRING, isReferencedBy STRING, isReplacedBy STRING,
    isRequiredBy STRING, isVersionOf STRING, issued STRING, language STRING, license STRING, mediator STRING, medium STRING,
    modified STRING, provenance STRING, publisher STRING, references STRING, relation STRING, replaces STRING, requires STRING,
    rights STRING, rightsHolder STRING, source STRING, spatial STRING, subject STRING, tableOfContents STRING, temporal STRING,
    title STRING, type STRING, valid STRING, institutionID STRING, collectionID STRING, datasetID STRING, institutionCode STRING,
    collectionCode STRING, datasetName STRING, ownerInstitutionCode STRING, basisOfRecord STRING, informationWithheld STRING,
    dataGeneralizations STRING, dynamicProperties STRING, occurrenceID STRING, catalogNumber STRING, recordNumber STRING,
    recordedBy STRING, recordedByID STRING, individualCount STRING, organismQuantity STRING, organismQuantityType STRING,
    sex STRING, lifeStage STRING, reproductiveCondition STRING, behavior STRING, establishmentMeans STRING,
    degreeOfEstablishment STRING, pathway STRING, georeferenceVerificationStatus STRING, occurrenceStatus STRING, preparations STRING,
    disposition STRING, associatedOccurrences STRING, associatedReferences STRING, associatedSequences STRING, associatedTaxa STRING,
    otherCatalogNumbers STRING, occurrenceRemarks STRING, organismID STRING, organismName STRING, organismScope STRING,
    associatedOrganisms STRING, previousIdentifications STRING, organismRemarks STRING, materialSampleID STRING, eventID STRING,
    parentEventID STRING, fieldNumber STRING, eventDate STRING, eventTime STRING, startDayOfYear STRING, endDayOfYear STRING,
    year STRING, month STRING, day STRING, verbatimEventDate STRING, habitat STRING, samplingProtocol STRING, sampleSizeValue STRING,
    sampleSizeUnit STRING, samplingEffort STRING, fieldNotes STRING, eventRemarks STRING, locationID STRING, higherGeographyID STRING,
    higherGeography STRING, continent STRING, waterBody STRING, islandGroup STRING, island STRING, countryCode STRING,
    stateProvince STRING, county STRING, municipality STRING, locality STRING, verbatimLocality STRING, verbatimElevation STRING,
    verticalDatum STRING, verbatimDepth STRING, minimumDistanceAboveSurfaceInMeters STRING, maximumDistanceAboveSurfaceInMeters STRING,
    locationAccordingTo STRING, locationRemarks STRING, decimalLatitude DOUBLE, decimalLongitude DOUBLE,
    coordinateUncertaintyInMeters DOUBLE, coordinatePrecision STRING, pointRadiusSpatialFit STRING, verbatimCoordinateSystem STRING,
    verbatimSRS STRING, footprintWKT STRING, footprintSRS STRING, footprintSpatialFit STRING, georeferencedBy STRING,
    georeferencedDate STRING, georeferenceProtocol STRING, georeferenceSources STRING, georeferenceRemarks STRING,
    geologicalContextID STRING, earliestEonOrLowestEonothem STRING, latestEonOrHighestEonothem STRING, earliestEraOrLowestErathem STRING,
    latestEraOrHighestErathem STRING, earliestPeriodOrLowestSystem STRING, latestPeriodOrHighestSystem STRING,
    earliestEpochOrLowestSeries STRING, latestEpochOrHighestSeries STRING, earliestAgeOrLowestStage STRING, latestAgeOrHighestStage STRING,
    lowestBiostratigraphicZone STRING, highestBiostratigraphicZone STRING, lithostratigraphicTerms STRING, group STRING, formation STRING,
    member STRING, bed STRING, identificationID STRING, verbatimIdentification STRING, identificationQualifier STRING, typeStatus STRING,
    identifiedBy STRING, identifiedByID STRING, dateIdentified STRING, identificationReferences STRING,
    identificationVerificationStatus STRING, identificationRemarks STRING, taxonID STRING, scientificNameID STRING,
    acceptedNameUsageID STRING, parentNameUsageID STRING, originalNameUsageID STRING, nameAccordingToID STRING, namePublishedInID STRING,
    taxonConceptID STRING, scientificName STRING, acceptedNameUsage STRING, parentNameUsage STRING, originalNameUsage STRING,
    nameAccordingTo STRING, namePublishedIn STRING, namePublishedInYear STRING, higherClassification STRING, kingdom STRING, phylum STRING,
    class STRING, order_ STRING, family STRING, subfamily STRING, genus STRING, genericName STRING, subgenus STRING,
    infragenericEpithet STRING, specificEpithet STRING, infraspecificEpithet STRING, cultivarEpithet STRING, taxonRank STRING,
    verbatimTaxonRank STRING, vernacularName STRING, nomenclaturalCode STRING, taxonomicStatus STRING, nomenclaturalStatus STRING,
    taxonRemarks STRING, datasetKey STRING, publishingCountry STRING, lastInterpreted STRING, elevation STRING, elevationAccuracy STRING,
    depth STRING, depthAccuracy STRING, distanceAboveSurface STRING, distanceAboveSurfaceAccuracy STRING, issue STRING, mediaType STRING,
    hasCoordinate BOOLEAN, hasGeospatialIssues STRING, taxonKey STRING, acceptedTaxonKey STRING, kingdomKey STRING, phylumKey STRING,
    classKey STRING, orderKey STRING, familyKey STRING, genusKey STRING, subgenusKey STRING, speciesKey STRING, species STRING,
    acceptedScientificName STRING, verbatimScientificName STRING, typifiedName STRING, protocol STRING, lastParsed STRING, lastCrawled STRING,
    repatriated STRING, relativeOrganismQuantity STRING, level0Gid STRING, level0Name STRING, level1Gid STRING, level1Name STRING,
    level2Gid STRING, level2Name STRING, level3Gid STRING, level3Name STRING, iucnRedListCategory STRING
  )
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  STORED AS TEXTFILE
  LOCATION 'hdfs://ha-nn/user/mblissett/0174958-210914110416597'
  TBLPROPERTIES ('skip.header.line.count'='1')
  ;

Notes

gzip --fast: 5½ hours, count 102.106 seconds, 7 hours 10 minutes 13 seconds 390 msec

direct: 3 hours, count 158.858 seconds, 18 hours 2 minutes 6 seconds 420 msec

direct -l: 2¾ hours, count 156.737 seconds, 16 hours 47 minutes 25 seconds 40 msec

zstd: 2¾ hours

1. It might be possible to extend hadoop-compress to store the deflated chunk offsets as metadata within an archive, and then to read these offsets to allow parallel decompression of the archive starting from fixed points. Matt tried this out in 2022.