Web log archive

This documentation is only visible to GBIF staff.

Web access logs from the production Varnish servers are stored in HDFS to allow for querying with Hive. This can show usage statistics, long-term performance changes and so on. User IP addresses are hashed for anonymity.

Querying the Hive table

The table is weblogs.weblog in Hive on C5.

The requestTime column is a BIGINT, although the underlying Parquet files use a TIMESTAMP. When we are using a later version of Hive, the awkward date conversion functions will no longer be necessary.

User agents are not modified, though this could be done in the future:
SELECT
  server,
  count(*) AS c,
  userAgent
FROM weblogs.weblog
WHERE requestTime < unix_timestamp('2023-01-01', 'yyyy-MM-dd')*1000
GROUP BY server, userAgent
ORDER BY server, c;
...
15503   Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36
38911   node-fetch/1.0 (+https://github.com/bitinn/node-fetch)
57579   check_http/v2.3.3 (nagios-plugins 2.3.3)
Monitor average response duration over time
SELECT
  date_format(from_unixtime(floor(requestTime/1000.0)), 'YYYY-\'W\'ww') AS yearWeek,
  count(*) AS count,
  floor(avg(requestDurationMicroseconds)) AS avgDuration
FROM weblogs.weblog
WHERE requestHostname = 'api.gbif.org'
  AND requestPath LIKE '/v2/map/occurrence/density%mvt%'
GROUP BY date_format(from_unixtime(floor(requestTime/1000.0)), 'YYYY-\'W\'ww')
ORDER BY yearWeek;

2023-W28        516291  82655
2023-W29        1263955 93816
2023-W30        539920  83856
...
A more complex example, checking response time for particular paths:
SELECT
  date_format(from_unixtime(floor(requestTime/1000.0)), 'YYYY-MM-dd:H') AS datetime,
  count(*) AS count,
  floor(avg(requestDurationMicroseconds)) AS avgDuration,
  floor(sum(requestDurationMicroseconds)/1000000.0) AS totalDurationSeconds,
  requestMethod,
  regexp_replace(regexp_replace(regexp_replace(requestPath, '\\d{7}-\\d{15}', '0000000-000000000000000'), '/user/[^/]+', '/user/USER'), '[0-9a-f-]{36}', '00000000-0000-0000-0000-000000000000') AS requestPathNormalized
FROM weblogs.weblog
WHERE server = 'prodapicache-vh.gbif.org'
  AND requestTime >= unix_timestamp('2023-07-01', 'yyyy-MM-dd')*1000
  AND requestPath LIKE '/v1/occurrence/download%'
  AND status = 200
GROUP BY
  date_format(from_unixtime(floor(requestTime/1000.0)), 'YYYY-MM-dd:H'),
  requestMethod,
  regexp_replace(regexp_replace(regexp_replace(requestPath, '\\d{7}-\\d{15}', '0000000-000000000000000'), '/user/[^/]+', '/user/USER'), '[0-9a-f-]{36}', '00000000-0000-0000-0000-000000000000')
ORDER BY requestPathNormalized, requestMethod, datetime;

2023-7-24:21    556     207038  GET     /v1/occurrence/download/0000000-000000000000000/datasets
2023-7-24:22    585     184633  GET     /v1/occurrence/download/0000000-000000000000000/datasets
2023-7-24:23    838     267569  GET     /v1/occurrence/download/0000000-000000000000000/datasets
...
An even more complex example, showing the frequency of search parameters used with the occurrence search endpoint:
SELECT
  searchparam,
  count(1) AS c
FROM weblogs.weblog
LATERAL VIEW explode(map_keys(str_to_map(substring(lower(regexp_replace(requestquery, '_', '')), 2), '&', '='))) searchparamtable AS searchparam
WHERE server = 'prodapicache-vh.gbif.org'
  AND requestquery IS NOT NULL
  AND requestpath LIKE '/v1/occurrence/search%'
GROUP BY searchparam
HAVING count(1) > 1000000
ORDER BY c DESC
LIMIT 100;

limit                   1167973788
taxon_key               448807411
taxonKey                444973495
media_type              433701395
offset                  235623329
has_coordinate          184099638
dataset_key             176426102
has_geospatial_issue    171145584
datasetKey              152112174
...

Available columns

Column name Type Definition

server

String

Hostname of server handling the request.

remoteHost

String

GBIF server hostname or salted, hashed IP address. The salt changes daily, so for the same user the hash will change at midnight.

backend

String

Not yet used.

requestDurationMicroseconds

BigInt

remoteUser

String

May be present for download requests, registry changes etc. Depends on the authentication method.

requestTime

Timestamp

Millisecond precision. date +%s000 is the current time in this format.

requestMethod

String

GET, POST, PUT, DELETE etc

requestHostname

String

Target of request. api.gbif.org, www.gbif.org, tile.gbif.org etc. for typical valid requests.

requestPath

String

Request path, without query string.

requestQuery

String

Request query string, with initial ? changed to &. Length is limited by Varnish, so may be truncated for long request queries

requestProtocol

String

HTTP/1.1, HTTP/2.0 etc

status

SmallInt

Response, 200 etc.

bytesSent

BigInt

NB response is often compressed.

cacheStatus

String

Was the request served from a Varnish cache without going to the backend?

referrer

String

Often just the hostname for privacy.

userAgent

String

Unaltered

country

String

Based on the IP address. ZZ for GBIF servers.

Cache status can take one of these values:

cacheStatus value Meaning

hit_cached

Page was already cached

hit_graced_cached

Page was already cached, had expired, but Varnish sent the expired response anyway

miss

Not sure, maybe these are old?

miss_cached

Page wasn’t cached, and has been added to the cache

miss_uncacheable

Page wasn’t cached, and can’t be cached

pass_uncacheable

Passthrough connection

pipe_uncacheable

Passthrough connection

synth_synth

Synthetic response, e.g. redirect produced by Varnish

Files on HDFS

These are at /data/weblogs in Parquet format. The conversion from text to Parquet is made by the log-converter tool, triggered by the Logrotate hook running on each production Varnish server. The IP address geolocation database is the free one from MaxMind.