Occurrence download statistics and maintenance

This documentation is only visible to GBIF staff.

Statistics

Generally the comms team ask about these.

Users with large numbers of downloads:

Some people are surprisingly patient, and wait months for thousands of tiny downloads to complete instead of making a single, large download or asking for advice.

SELECT u.username, u.settings->'country' AS country, COUNT(*) AS c, ROUND(SUM(size) / 1e9) AS size_gb
FROM occurrence_download od, public.user u
WHERE od.created_by = u.username AND od.created >= '2024-01-01' AND od.created < '2024-11-01' AND od.status IN ('SUCCEEDED', 'FILE_ERASED')
GROUP BY u.username, u.settings->'country'
HAVING COUNT(*) >= 500
ORDER BY c;

Number of records downloaded monthly:

Other than seeing usage increas, this number isn’t much use. People can easily create a download containing millions of records and never use it, or only use a few records.

It doesn’t know about users downloading the monthly downloads (created by download.gbif.org), whether that’s accessing the cloud snapshots directly or downloading the file from us.

SELECT
  TO_CHAR(created, 'YYYY-mm') AS yearMonth,
  SUM(total_records) AS total_records,
  COUNT(*) AS number_downloads
FROM occurrence_download
GROUP BY yearMonth
ORDER BY yearMonth;

Number of registered user accounts:

Includes some spam accounts and so on.

SELECT COUNT(*) AS c FROM public.user u WHERE deleted IS NULL;

Number of registered accounts that have logged in within the last year:

Includes recent spam accounts.

SELECT count(*) AS c FROM public.user u WHERE deleted IS NULL AND last_login > '2023-10-18';

Number of registered user accounts that have made at least one download:

SELECT
  u.settings->'country' AS country,
  count(DISTINCT u.username) AS c
FROM public.user u
JOIN occurrence_download od ON u.username = od.created_by
GROUP BY u.settings->'country'
ORDER BY c;

Maintenance

Users with duplicate large downloads

The occurrence-ws is supposed to prevent this, but some duplicate downloads are still made (somehow).

SELECT COUNT(*), created_by, format, ROUND(MAX(size) / 1e9) AS size_gb, MIN(created), MAX(created), filter
FROM occurrence_download
WHERE status = 'SUCCEEDED' AND created > '2022-07-01' AND size > 50e9
GROUP BY created_by, format, filter, DATE_PART('year', created), DATE_PART('doy', created)
HAVING COUNT(*) > 1
ORDER BY size_gb DESC;

Users with many very large downloads

Often these appear to be mistakes, such as the same download being created several times (impatience?).

The same would apply for smaller downloads, but that isn’t a disc space problem.

SELECT
  key, created, created_by, ROUND(size / 1.0e9) AS size_gb,
  SUM(ROUND(size / 1.0e9)) OVER (PARTITION BY created_by) AS total_user_size_gb,
  format, MD5(filter), filter
FROM occurrence_download
WHERE created > '2022-01-01' AND created_by != 'download.gbif.org'
  AND size > 100e9 AND status = 'SUCCEEDED'
ORDER BY total_user_size_gb DESC, created_by, filter, created;

Downloads without dataset metadata

This is a bug in the download workflow.

SELECT *
FROM (SELECT od.key, od.created, od.size, od.total_records, COUNT(dod.download_key)
      FROM occurrence_download od LEFT JOIN dataset_occurrence_download dod ON od.key = dod.download_key
      WHERE od.created > '2024-06-01 00:00' AND od.size > 20000 AND od.status IN ('SUCCEEDED', 'FILE_ERASED')
      GROUP BY od.key, od.created, od.size, od.total_records) t
WHERE t.count = 0
ORDER BY created DESC;