Data cubes
This is an experimental feature, and the implementation may change throughout 2024. The feature is currently available for preview on the GBIF test environment, GBIF-UAT.org. |
Data cubes group species occurrence data along spatial, temporal and/or taxonomic dimensions.
They can be produced using a prototype GBIF API, or using the GBIF occurrence data deposited on external cloud services. This page covers generating a cube using GBIF’s API. See Generate a species occurrence cube using Microsoft Azure Databricks to generate a cube using a public cloud service.
Demonstration cube specification
The guides below will produce a data cube for occurrences of animals from Poland, recorded during or after 2000.
The dimensions of the cube will be
-
Species
-
We must therefore exclude occurrences not identified to species level
-
We will also exclude occurrences with known uncertain identifications
-
-
Year-Month
-
We must exclude occurrences with only year accuracy, or with an uncertain date spanning multiple months
-
-
EEA Reference grid
-
We will randomize the point using the coordinateUncertaintyInMeters, defaulting to 1000m
-
We will exclude occurrences without coordinates
-
-
Sex
-
This is included to demonstrate a basic dimension
-
-
Life stage
-
This is included to demonstrate a more complex field (a vocabulary field) used for a dimension
-
The measures of the cube will be
-
Occurrence count
-
Minimum coordinate uncertainty
-
Minimum temporal uncertainty
Generating a data cube using the GBIF API
This is an experimental feature, and the implementation may change throughout 2024. The feature is currently available for preview on the GBIF test environment, GBIF-UAT.org. |
Create an initial GBIF search filter
Perhaps using the GBIF.org website, find an approximate search filter, for example for a particular taxon, country or region, year range, etc. This will need to be expressed in SQL for submission to the API:
WHERE occurrenceStatus = 'PRESENT'
AND countryCode = 'PL' (1)
AND year >= 2000
AND kingdomKey = 1 (2)
AND hasCoordinate = TRUE (3)
1 | Poland |
2 | Animalia |
3 | Include only occurrences with coordinates |
Exclude unwanted data
The GBIF website and search APIs do not allow excluding data from searches, but this is often required for data cubes. We add additional expressions to our WHERE clause:
AND speciesKey IS NOT NULL (1)
AND NOT array_contains(issue, 'COUNTRY_COORDINATE_MISMATCH') (2)
AND month IS NOT NULL (3)
1 | Exclude occurrences not identified to species level |
2 | Exclude occurrences with this georeferencing issue |
3 | Exclude occurrences without a month, or spanning multiple months |
Write full query
The WHERE clause is the filter, but the dimensions and measurements must be defined. These are used for the SELECT and GROUP BY clauses.
Typical data cubes will use an SQL query similar to this one:
SELECT (1)
-- Dimensions: (2)
"year", (3)
GBIF_EEARGCode(
1000,
decimalLatitude,
decimalLongitude,
COALESCE(coordinateUncertaintyInMeters, 1000) (4)
) AS eeaCellCode,
speciesKey,
COALESCE(sex, 'NOT_SUPPLIED') AS sex,
COALESCE(occurrence.lifestage.concept, 'NOT_SUPPLIED') AS lifestage,
-- Measurements (5)
COUNT(*) AS occurrences, (6)
MIN(COALESCE(coordinateUncertaintyInMeters, 1000)) AS minCoordinateUncertaintyInMeters
FROM
occurrence
WHERE occurrenceStatus = 'PRESENT'
AND countryCode = 'PL'
AND "year" >= 2000
AND kingdomKey = 1
AND hasCoordinate = TRUE
AND speciesKey IS NOT NULL
AND NOT ARRAY_CONTAINS(issue, 'COUNTRY_COORDINATE_MISMATCH')
AND "month" IS NOT NULL
GROUP BY
"year",
eeaCellCode,
speciesKey,
sex,
lifestage
ORDER BY (7)
"year" DESC,
eeaCellCode ASC,
speciesKey ASC;
1 | These are the columns in the resulting cube, i.e. the dimensions and measurements for the cube. |
2 | The dimensions must also appear in the GROUP BY section. They can include functions. |
3 | YEAR (or year ) is an SQL function, so we must quote the name of the column as "year" . The same for "month" . |
4 | COALESCE sets a default value (1000) if the first value is absent. |
5 | The measurements must be SQL aggregate functions, like COUNT , MIN , MAX , AVERAGE , SUM etc |
6 | AS gives a name to the column, used as the header in the result file |
7 | The ORDER BY section is optional. |
See API SQL Downloads – Functions for descriptions and arguments for the functions, including the GBIF_EEARGCode
(EEA reference grid) function.
Submit the query to GBIF
See API SQL Downloads for instructions.
A cube with test data from gbif-uat.org using this query is available here.
year | eeacellcode | specieskey | sex | lifestage | occurrences | mincoordinateuncertaintyinmeters |
---|---|---|---|---|---|---|
2020 |
1kmE5156N3446 |
1310622 |
MALE |
Adult |
1 |
1000.0 |
2020 |
1kmE5156N3447 |
1310622 |
MALE |
Adult |
1 |
1000.0 |
2020 |
1kmE5157N3192 |
1631718 |
NOT_SUPPLIED |
Adult |
1 |
1000.0 |
2020 |
1kmE5157N3192 |
1631721 |
NOT_SUPPLIED |
Adult |
2 |
1000.0 |
2020 |
1kmE5157N3192 |
1631728 |
NOT_SUPPLIED |
Adult |
2 |
1000.0 |
2020 |
1kmE5157N3192 |
1632179 |
NOT_SUPPLIED |
Adult |
3 |
1000.0 |