Web log archive
This documentation is only visible to GBIF staff. |
Web access logs from the production Varnish servers are stored in HDFS to allow for querying with Hive. This can show usage statistics, long-term performance changes and so on. User IP addresses are hashed for anonymity.
Querying the Hive table
The table is weblogs.weblog
in Hive on C5.
The requestTime
column is a BIGINT
, although the underlying Parquet files use a TIMESTAMP
. When we are using a later version of Hive, the awkward date conversion functions will no longer be necessary.
SELECT
server,
count(*) AS c,
userAgent
FROM weblogs.weblog
WHERE requestTime < unix_timestamp('2023-01-01', 'yyyy-MM-dd')*1000
GROUP BY server, userAgent
ORDER BY server, c;
...
15503 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36
38911 node-fetch/1.0 (+https://github.com/bitinn/node-fetch)
57579 check_http/v2.3.3 (nagios-plugins 2.3.3)
SELECT
date_format(from_unixtime(floor(requestTime/1000.0)), 'YYYY-\'W\'ww') AS yearWeek,
count(*) AS count,
floor(avg(requestDurationMicroseconds)) AS avgDuration
FROM weblogs.weblog
WHERE requestHostname = 'api.gbif.org'
AND requestPath LIKE '/v2/map/occurrence/density%mvt%'
GROUP BY date_format(from_unixtime(floor(requestTime/1000.0)), 'YYYY-\'W\'ww')
ORDER BY yearWeek;
2023-W28 516291 82655
2023-W29 1263955 93816
2023-W30 539920 83856
...
SELECT
date_format(from_unixtime(floor(requestTime/1000.0)), 'YYYY-MM-dd:H') AS datetime,
count(*) AS count,
floor(avg(requestDurationMicroseconds)) AS avgDuration,
floor(sum(requestDurationMicroseconds)/1000000.0) AS totalDurationSeconds,
requestMethod,
regexp_replace(regexp_replace(regexp_replace(requestPath, '\\d{7}-\\d{15}', '0000000-000000000000000'), '/user/[^/]+', '/user/USER'), '[0-9a-f-]{36}', '00000000-0000-0000-0000-000000000000') AS requestPathNormalized
FROM weblogs.weblog
WHERE server = 'prodapicache-vh.gbif.org'
AND requestTime >= unix_timestamp('2023-07-01', 'yyyy-MM-dd')*1000
AND requestPath LIKE '/v1/occurrence/download%'
AND status = 200
GROUP BY
date_format(from_unixtime(floor(requestTime/1000.0)), 'YYYY-MM-dd:H'),
requestMethod,
regexp_replace(regexp_replace(regexp_replace(requestPath, '\\d{7}-\\d{15}', '0000000-000000000000000'), '/user/[^/]+', '/user/USER'), '[0-9a-f-]{36}', '00000000-0000-0000-0000-000000000000')
ORDER BY requestPathNormalized, requestMethod, datetime;
2023-7-24:21 556 207038 GET /v1/occurrence/download/0000000-000000000000000/datasets
2023-7-24:22 585 184633 GET /v1/occurrence/download/0000000-000000000000000/datasets
2023-7-24:23 838 267569 GET /v1/occurrence/download/0000000-000000000000000/datasets
...
SELECT
searchparam,
count(1) AS c
FROM weblogs.weblog
LATERAL VIEW explode(map_keys(str_to_map(substring(lower(regexp_replace(requestquery, '_', '')), 2), '&', '='))) searchparamtable AS searchparam
WHERE server = 'prodapicache-vh.gbif.org'
AND requestquery IS NOT NULL
AND requestpath LIKE '/v1/occurrence/search%'
GROUP BY searchparam
HAVING count(1) > 1000000
ORDER BY c DESC
LIMIT 100;
limit 1167973788
taxon_key 448807411
taxonKey 444973495
media_type 433701395
offset 235623329
has_coordinate 184099638
dataset_key 176426102
has_geospatial_issue 171145584
datasetKey 152112174
...
Available columns
Column name | Type | Definition |
---|---|---|
|
String |
Hostname of server handling the request. |
|
String |
GBIF server hostname or salted, hashed IP address. The salt changes daily, so for the same user the hash will change at midnight. |
|
String |
Not yet used. |
|
BigInt |
|
|
String |
May be present for download requests, registry changes etc. Depends on the authentication method. |
|
Timestamp |
Millisecond precision. |
|
String |
GET, POST, PUT, DELETE etc |
|
String |
Target of request. api.gbif.org, www.gbif.org, tile.gbif.org etc. for typical valid requests. |
|
String |
Request path, without query string. |
|
String |
Request query string, with initial |
|
String |
HTTP/1.1, HTTP/2.0 etc |
|
SmallInt |
Response, 200 etc. |
|
BigInt |
NB response is often compressed. |
|
String |
Was the request served from a Varnish cache without going to the backend? |
|
String |
Often just the hostname for privacy. |
|
String |
Unaltered |
|
String |
Based on the IP address. |
Cache status can take one of these values:
cacheStatus value | Meaning |
---|---|
hit_cached |
Page was already cached |
hit_graced_cached |
Page was already cached, had expired, but Varnish sent the expired response anyway |
miss |
Not sure, maybe these are old? |
miss_cached |
Page wasn’t cached, and has been added to the cache |
miss_uncacheable |
Page wasn’t cached, and can’t be cached |
pass_uncacheable |
Passthrough connection |
pipe_uncacheable |
Passthrough connection |
synth_synth |
Synthetic response, e.g. redirect produced by Varnish |
Files on HDFS
These are at /data/weblogs in Parquet format. The conversion from text to Parquet is made by the log-converter tool, triggered by the Logrotate hook running on each production Varnish server. The IP address geolocation database is the free one from MaxMind.