DATA¶
Download data¶
The patentCity database is available as an open access dataset (CC-BY-4).
Citation¶
If you use the data, please cite Bergeaud and Verluise (2021) and De Rassenfosse, Kozak and Seliger (2019)
@techreport{bergeaudVerluise2021,
title={One Century of Innovation in Europe and the US},
author={Bergeaud, Antonin and Verluise, Cyril},
year={2021}
}
@article{deRassenfosse2019,
title={Geocoding of worldwide patent data},
author={De Rassenfosse, Ga{\'e}tan and Kozak, Jan and Seliger, Florian},
journal={Scientific data},
volume={6},
number={1},
pages={1--15},
year={2019},
publisher={Nature Publishing Group}
}
Bergeaud, Antonin and Cyril Verluise. "One Century of Innovation in Europe and the US". 2021
De Rassenfosse, Gaetan, Jan Kozak, and Florian Seliger. "Geocoding of worldwide patent data." Scientific data 6, no. 1 (2019): 1-15.
Database schema¶
The database is available in 2 formats: csv
and jsonl
. Both contain the same data and are compatible with most SQL database engine and cloud data wharehouses such as BigQuery (GCP) and Athena (AWS) which should be your preferred way to work the patentcity full database.
jsonl
or csv
The jsonl
format allows nested data, hence is more compact.
The csv
format can be read off-the-shell by any library supporting tabular/structured data (e.g. pandas for python, dplyr for R, etc).
Nested fields
- Dotted variables (e.g.
patentee.is_inv
) indicate nested fields. - In the csv flavour of the data, the prefix (e.g.
patentee.
) is not reported. The database has already been flattened.
name | description | mode | type |
---|---|---|---|
publication_number | Publication number. | NULLABLE | STRING |
publication_date | Publication date (yyyymmdd). | NULLABLE | INTEGER |
country_code | Country code of the patent office. | NULLABLE | STRING |
pubnum | Publication number. | NULLABLE | STRING |
kind_code | Kind code. | NULLABLE | STRING |
family_id | Family ID (DOCDB). | NULLABLE | STRING |
cpc_code | Comma-separated list of cpc-codes. | NULLABLE | STRING |
origin | Indicates the origin of the patentee data (PC: patentcity, WGP25: Worldwide Geocoding of Patent - slot 25, WGP45: Worldwide Geocoding of Patent - slot 45, EXP: expansion ). | NULLABLE | STRING |
patentee.is_inv | True if the patentee is an inventor, else False. | NULLABLE | BOOLEAN |
patentee.is_asg | True if the patentee is an assignee, else False. | NULLABLE | BOOLEAN |
patentee.name_text | Name. | NULLABLE | STRING |
patentee.person_id | Person ID (PATSTAT). | NULLABLE | INTEGER |
patentee.name_start | Name start. | NULLABLE | INTEGER |
patentee.name_end | Name end. | NULLABLE | INTEGER |
patentee.occ_text | Occupation text. | NULLABLE | STRING |
patentee.occ_start | Occupation start. | NULLABLE | INTEGER |
patentee.occ_end | Occupation end. | NULLABLE | INTEGER |
patentee.cit_text | Citizenship text. | NULLABLE | STRING |
patentee.cit_code | Citizenship code. | NULLABLE | STRING |
patentee.cit_start | Citizenship start. | NULLABLE | INTEGER |
patentee.cit_end | Citizenship end. | NULLABLE | INTEGER |
patentee.loc_text | Location text. | NULLABLE | STRING |
patentee.loc_start | Location start. | NULLABLE | INTEGER |
patentee.loc_end | Location end. | NULLABLE | INTEGER |
patentee.loc_addressLines | Formatted address lines built out of the parsed address components. | NULLABLE | STRING |
patentee.loc_locationLabel | Assembled address value for displaying purposes. | NULLABLE | STRING |
patentee.loc_country | ISO 3166-alpha-3 country code. | NULLABLE | STRING |
patentee.loc_state | First subdivision level(s) below the country. Where commonly used, this is a state code (for instance, CA for California). | NULLABLE | STRING |
patentee.loc_county | Second subdivision level(s) below the country. Use of this field is optional if a second subdivision level is not available. | NULLABLE | STRING |
patentee.loc_city | Locality of the address. | NULLABLE | STRING |
patentee.loc_district | Subdivision level below the city. Use of this field is optional if a second subdivision level is not available. | NULLABLE | STRING |
patentee.loc_subdistrict | Subdivision level below the district. Used only for India. | NULLABLE | STRING |
patentee.loc_postalCode | Postal code. | NULLABLE | STRING |
patentee.loc_street | Street name. | NULLABLE | STRING |
patentee.loc_building | Building name. | NULLABLE | STRING |
patentee.loc_houseNumber | House number. | NULLABLE | STRING |
patentee.loc_longitude | Longitude. | NULLABLE | FLOAT |
patentee.loc_latitude | Latitude. | NULLABLE | FLOAT |
patentee.loc_relevance | Indicates the relevance of the results found; the higher the score the more relevant the alternative. The score is a normalized value between 0 and 1. | NULLABLE | FLOAT |
patentee.loc_matchType | Quality of the location match. pointAddress: Location matches exactly as point address. interpolated: Location was interpolated. | NULLABLE | STRING |
patentee.loc_matchCode | Code indicating how well the result matches the request. Enumeration [exact, ambiguous, upHierarchy, ambiguousUpHierarchy]. | NULLABLE | STRING |
patentee.loc_matchLevel | The most detailed address field that matched the input record. | NULLABLE | STRING |
patentee.loc_matchQualityCountry | MatchQuality provides detailed information about the match quality of a result at attribute level. Match quality is a value between 0.0 and 1.0. 1.0 represents a 100% match. Here, matchQuality is defined at country level. | NULLABLE | FLOAT |
patentee.loc_matchQualityState | Same at state level. | NULLABLE | FLOAT |
patentee.loc_matchQualityCounty | Same at county level. | NULLABLE | FLOAT |
patentee.loc_matchQualityCity | Same at city level. | NULLABLE | FLOAT |
patentee.loc_matchQualityDistrict | Same at district level. | NULLABLE | FLOAT |
patentee.loc_matchQualityPostalCode | Same at postalCode level. | NULLABLE | FLOAT |
patentee.loc_matchQualityStreet | Same at street level. | NULLABLE | FLOAT |
patentee.loc_matchQualityHouseNumber | Same at houseNumber level. | NULLABLE | FLOAT |
patentee.loc_matchQualityBuilding | Same at building level. | NULLABLE | FLOAT |
patentee.loc_key | Key used for statistical area mapping (internal use). | NULLABLE | STRING |
patentee.loc_statisticalArea1 | Name of the high level Statistical Area. | NULLABLE | STRING |
patentee.loc_statisticalArea1Code | Code of the high level Statistical Area. | NULLABLE | STRING |
patentee.loc_statisticalArea2 | Name of the mid level Statistical Area. | NULLABLE | STRING |
patentee.loc_statisticalArea2Code | Code of the mid level Statistical Area. | NULLABLE | STRING |
patentee.loc_statisticalArea3 | Name of the low level Statistical Area. | NULLABLE | STRING |
patentee.loc_statisticalArea3Code | Code of the low level Statistical Area. | NULLABLE | STRING |
patentee.loc_recId | Identifier of the input address in the response. | NULLABLE | STRING |
patentee.loc_seqLength | Number of results for the corresponding input record. | NULLABLE | INTEGER |
patentee.loc_seqNumber | Consecutively numbers the different results for the corresponding input record starting with 1. | NULLABLE | INTEGER |
patentee.loc_source | Geocoding source (in [HERE, GMAPS, MANUAL]). | NULLABLE | STRING |
patentee.is_duplicate | True if a patentee with the 'same' name has been detected in the same patent. Only one of the two is marked as duplicate. | NULLABLE | BOOLEAN |
has_a | Whether the patent's family features an A kind code in the database [DOCDB family level]. | NULLABLE | BOOLEAN |
has_b | Whether the patent's family features a B kind code in the database [DOCDB family level]. | NULLABLE | BOOLEAN |
N | Number of patents in the family [DOCDB family level]. | NULLABLE | INTEGER |
name | description | mode | type |
---|---|---|---|
publication_number | Publication number. | NULLABLE | STRING |
publication_date | Publication date (yyyymmdd). | NULLABLE | INTEGER |
country_code | Country code of the patent office. | NULLABLE | STRING |
pubnum | Publication number. | NULLABLE | STRING |
kind_code | Kind code. | NULLABLE | STRING |
family_id | Family ID (DOCDB). | NULLABLE | STRING |
cpc_code | Comma-separated list of cpc-codes. | NULLABLE | STRING |
origin | Indicates the origin of the patentee data (PC: patentcity, WGP25: Worldwide Geocoding of Patent - slot 25, WGP45: Worldwide Geocoding of Patent - slot 45, EXP: expansion ). | NULLABLE | STRING |
is_inv | True if the patentee is an inventor, else False. | NULLABLE | BOOLEAN |
is_asg | True if the patentee is an assignee, else False. | NULLABLE | BOOLEAN |
name_text | Name. | NULLABLE | STRING |
person_id | Person ID (PATSTAT). | NULLABLE | INTEGER |
name_start | Name start. | NULLABLE | INTEGER |
name_end | Name end. | NULLABLE | INTEGER |
occ_text | Occupation text. | NULLABLE | STRING |
occ_start | Occupation start. | NULLABLE | INTEGER |
occ_end | Occupation end. | NULLABLE | INTEGER |
cit_text | Citizenship text. | NULLABLE | STRING |
cit_code | Citizenship code. | NULLABLE | STRING |
cit_start | Citizenship start. | NULLABLE | INTEGER |
cit_end | Citizenship end. | NULLABLE | INTEGER |
loc_text | Location text. | NULLABLE | STRING |
loc_start | Location start. | NULLABLE | INTEGER |
loc_end | Location end. | NULLABLE | INTEGER |
loc_addressLines | Formatted address lines built out of the parsed address components. | NULLABLE | STRING |
loc_locationLabel | Assembled address value for displaying purposes. | NULLABLE | STRING |
loc_country | ISO 3166-alpha-3 country code. | NULLABLE | STRING |
loc_state | First subdivision level(s) below the country. Where commonly used, this is a state code (for instance, CA for California). | NULLABLE | STRING |
loc_county | Second subdivision level(s) below the country. Use of this field is optional if a second subdivision level is not available. | NULLABLE | STRING |
loc_city | Locality of the address. | NULLABLE | STRING |
loc_district | Subdivision level below the city. Use of this field is optional if a second subdivision level is not available. | NULLABLE | STRING |
loc_subdistrict | Subdivision level below the district. Used only for India. | NULLABLE | STRING |
loc_postalCode | Postal code. | NULLABLE | STRING |
loc_street | Street name. | NULLABLE | STRING |
loc_building | Building name. | NULLABLE | STRING |
loc_houseNumber | House number. | NULLABLE | STRING |
loc_longitude | Longitude. | NULLABLE | FLOAT |
loc_latitude | Latitude. | NULLABLE | FLOAT |
loc_relevance | Indicates the relevance of the results found; the higher the score the more relevant the alternative. The score is a normalized value between 0 and 1. | NULLABLE | FLOAT |
loc_matchType | Quality of the location match. pointAddress: Location matches exactly as point address. interpolated: Location was interpolated. | NULLABLE | STRING |
loc_matchCode | Code indicating how well the result matches the request. Enumeration [exact, ambiguous, upHierarchy, ambiguousUpHierarchy]. | NULLABLE | STRING |
loc_matchLevel | The most detailed address field that matched the input record. | NULLABLE | STRING |
loc_matchQualityCountry | MatchQuality provides detailed information about the match quality of a result at attribute level. Match quality is a value between 0.0 and 1.0. 1.0 represents a 100% match. Here, matchQuality is defined at country level. | NULLABLE | FLOAT |
loc_matchQualityState | Same at state level. | NULLABLE | FLOAT |
loc_matchQualityCounty | Same at county level. | NULLABLE | FLOAT |
loc_matchQualityCity | Same at city level. | NULLABLE | FLOAT |
loc_matchQualityDistrict | Same at district level. | NULLABLE | FLOAT |
loc_matchQualityPostalCode | Same at postalCode level. | NULLABLE | FLOAT |
loc_matchQualityStreet | Same at street level. | NULLABLE | FLOAT |
loc_matchQualityHouseNumber | Same at houseNumber level. | NULLABLE | FLOAT |
loc_matchQualityBuilding | Same at building level. | NULLABLE | FLOAT |
loc_key | Key used for statistical area mapping (internal use). | NULLABLE | STRING |
loc_statisticalArea1 | Name of the high level Statistical Area. | NULLABLE | STRING |
loc_statisticalArea1Code | Code of the high level Statistical Area. | NULLABLE | STRING |
loc_statisticalArea2 | Name of the mid level Statistical Area. | NULLABLE | STRING |
loc_statisticalArea2Code | Code of the mid level Statistical Area. | NULLABLE | STRING |
loc_statisticalArea3 | Name of the low level Statistical Area. | NULLABLE | STRING |
loc_statisticalArea3Code | Code of the low level Statistical Area. | NULLABLE | STRING |
loc_recId | Identifier of the input address in the response. | NULLABLE | STRING |
loc_seqLength | Number of results for the corresponding input record. | NULLABLE | INTEGER |
loc_seqNumber | Consecutively numbers the different results for the corresponding input record starting with 1. | NULLABLE | INTEGER |
loc_source | Geocoding source (in [HERE, GMAPS, MANUAL]). | NULLABLE | STRING |
is_duplicate | True if a patentee with the 'same' name has been detected in the same patent. Only one of the two is marked as duplicate. | NULLABLE | BOOLEAN |
has_a | Whether the patent's family features an A kind code in the database [DOCDB family level]. | NULLABLE | BOOLEAN |
has_b | Whether the patent's family features a B kind code in the database [DOCDB family level]. | NULLABLE | BOOLEAN |
N | Number of patents in the family [DOCDB family level]. | NULLABLE | INTEGER |