Occurrence download statistics and maintenance
This documentation is only visible to GBIF staff. |
Statistics
Generally the comms team ask about these.
Users with large numbers of downloads:
Some people are surprisingly patient, and wait months for thousands of tiny downloads to complete instead of making a single, large download or asking for advice.
SELECT u.username, u.settings->'country' AS country, COUNT(*) AS c, ROUND(SUM(size) / 1e9) AS size_gb
FROM occurrence_download od, public.user u
WHERE od.created_by = u.username AND od.created >= '2024-01-01' AND od.created < '2024-11-01' AND od.status IN ('SUCCEEDED', 'FILE_ERASED')
GROUP BY u.username, u.settings->'country'
HAVING COUNT(*) >= 500
ORDER BY c;
Number of records downloaded monthly:
Other than seeing usage increas, this number isn’t much use. People can easily create a download containing millions of records and never use it, or only use a few records.
It doesn’t know about users downloading the monthly downloads (created by download.gbif.org
), whether that’s accessing the cloud snapshots directly or downloading the file from us.
SELECT
TO_CHAR(created, 'YYYY-mm') AS yearMonth,
SUM(total_records) AS total_records,
COUNT(*) AS number_downloads
FROM occurrence_download
GROUP BY yearMonth
ORDER BY yearMonth;
Number of registered user accounts:
Includes some spam accounts and so on.
SELECT COUNT(*) AS c FROM public.user u WHERE deleted IS NULL;
Maintenance
Users with duplicate large downloads
The occurrence-ws is supposed to prevent this, but some duplicate downloads are still made (somehow).
SELECT COUNT(*), created_by, format, ROUND(MAX(size) / 1e9) AS size_gb, MIN(created), MAX(created), filter
FROM occurrence_download
WHERE status = 'SUCCEEDED' AND created > '2022-07-01' AND size > 50e9
GROUP BY created_by, format, filter, DATE_PART('year', created), DATE_PART('doy', created)
HAVING COUNT(*) > 1
ORDER BY size_gb DESC;
Users with many very large downloads
Often these appear to be mistakes, such as the same download being created several times (impatience?).
The same would apply for smaller downloads, but that isn’t a disc space problem.
SELECT
key, created, created_by, ROUND(size / 1.0e9) AS size_gb,
SUM(ROUND(size / 1.0e9)) OVER (PARTITION BY created_by) AS total_user_size_gb,
format, MD5(filter), filter
FROM occurrence_download
WHERE created > '2022-01-01' AND created_by != 'download.gbif.org'
AND size > 100e9 AND status = 'SUCCEEDED'
ORDER BY total_user_size_gb DESC, created_by, filter, created;
Downloads without dataset metadata
This is a bug in the download workflow.
SELECT *
FROM (SELECT od.key, od.created, od.size, od.total_records, COUNT(dod.download_key)
FROM occurrence_download od LEFT JOIN dataset_occurrence_download dod ON od.key = dod.download_key
WHERE od.created > '2024-06-01 00:00' AND od.size > 20000 AND od.status IN ('SUCCEEDED', 'FILE_ERASED')
GROUP BY od.key, od.created, od.size, od.total_records) t
WHERE t.count = 0
ORDER BY created DESC;