STATISTICAL AREAS¶
Problem¶
We harvest administrative areas from the geocoding. They are already useful, but they have some limitations. In particular, they are not fully satisfying regarding the following features:
- between country comparability. We would like to be able to compare countries based on comparable objects (e.g. NUTS)
- common usage. We want to be able to interoperate our data with external statistics (e.g. demographic data, economic data, structural data)
Approach¶
We define three levels of "statistical areas". Level 1 is more aggregated than level 2 and so on. The below table details the statistical areas for each country.
Country code | Statistical area 1 | Statistical area 2 | Statistical area 3 |
---|---|---|---|
DD | - | - | - |
DE | NUTS1 | NUTS2 | NUTS3 |
FR | NUTS1 | NUTS2 | NUTS3 |
GB | NUTS1 | NUTS2 | NUTS3 |
US | State | Commuting Zone (1990) | County |
Mapping key?
Except for the US, we can use the postal code as the primary key for statistical areas. For the US, we us the combination of the state and county (or state and city if county is null).
WITH tmp AS (
SELECT
country_code,
CAST(patentee.loc_state IS NOT NULL AS INT64) AS has_state,
CAST(patentee.loc_county IS NOT NULL AS INT64) AS has_county,
CAST(patentee.loc_postalCode IS NOT NULL AS INT64) AS has_postalCode,
FROM
`patentcity.patentcity.v100rc3`,
UNNEST(patentee) AS patentee
#WHERE
# publication_date<=19800000
)
SELECT
country_code,
SUM(has_state) as has_state,
SUM(has_county) as has_county,
SUM(has_postalCode) as has_postalCode
FROM
tmp
GROUP BY
country_code
country_code | has_state | has_county | has_postalCode |
---|---|---|---|
DD | 495479 | 495479 | 495479 |
DE | 7284246 | 6555370 | 7366705 |
FR | 1881112 | 1872960 | 1864284 |
GB | 1759783 | 1709290 | 1750767 |
US | 39675118 | 29133272 | 12382827 |
Related issue¶
Administrative areas obtained directly from the geocoding services (loc_state
and loc_counties
) also exhibit intrinsic limitations:
- spelling inconsistencies between geocoding services (e.g. "Constance" vs "Konstanz")
- semantic inconsistencies between geocoding services (e.g. districts vs kreis as level 3 administrative area for DE)
Although we recommend the use of statistical areas, we also propose a solution using hand-made crossover tables. We do not implement the harmonization directly in the database as this is partly destructive. The solution (and its history) is described by issue #7.