Serialize data¶
EPO full-text tsv data limitations¶
The EP full text data are served in a tab-separated value (tsv) format. E.g.
EP 0600083 A1 1994-06-08 de TITLE VORRICHTUNG ZUM FÖRDERN UND ORIENTIEREN VON PAPIERBOGEN EP 0600083 A1 1994-06-08 en TITLE DEVICE FOR CONVEYING AND ARRANGING PAPER SHEET EP 0600083 A1 1994-06-08 fr TITLE DISPOSITIF D'ACHEMINEMENT ET DE POSITIONNEMENT DE FEUILLES DE PAPIER EP 0600083 A1 1994-06-08 en ABSTR <p id="pa01" num="0001">A device for conveying ... EP 0600083 A1 1994-06-08 en DESCR <heading id="h0001">Field of Technology</heading><p id="p0001" num="0001">This ... EP 0600083 A1 1994-06-08 en CLAIM <claim id="c-en-0001" num="0001"><claim-text>A paper sheet conveying ... EP 0600083 A1 1994-06-08 en PDFEP https://data.epo.org/publication-server/pdf-document?cc=EP&pn=0600083&ki=A1&pd=1994-06-08 ...
This comes with two drawbacks:
- Redundant information (e.g.
EP 0600083 A1
) - Not supported by major big data tools (e.g. BigQuery)
ParseEPO
: EPO full-text for humans¶
We overcome these drawbacks by serializing the data. Each patent is represented by a single json
object with
nested fields. E.g.
{"publication_number": "EP-0600083-A1" , "publication_date": "1994-06-08" , "country_code":["de","en","fr"], "title": { "language":["de","en","fr"] , "text":["VORRICHTUNG ZUM FÖRDERN UND ORIENTIEREN VON PAPIERBOGEN\n" , "DEVICE FOR CONVEYING AND ARRANGING PAPER SHEET\n" , "DISPOSITIF D'ACHEMINEMENT ET DE POSITIONNEMENT DE FEUILLES DE PAPIER\n"] } , "abstract": { "text": "<p id=\"pa01\" num=\"0001\">A device for ...", "language": "en" } , "claims": { "language":["en"] , "text":["<claim id=\"c-en-0001\" num=\"0001\"><claim-text>A paper ..."] } , "description": { "text": "<heading id=\"h0001\">Field of Technology</heading><p ..." , "language": "en" } , "url": { "text": "https://data.epo.org/publication-server/pdf-document?cc=EP&pn=0600083&ki=A1&pd=1994-06-08\n" , "language": "en" } , }
Note
In this example, variable names have been slightly modified (e.g. ABSTR
becomes abstract
). This is meant to align
variable names with BigQuery patents data standards. You can avoid this behavior by setting --no-prepare-names
, see tip:SerializeEPO.py below.
In practice¶
SerializeEPO.py
(python CLI) turns the EP tsv files into json newline delimited files.
python bin/serialize-epo.py \ --max-workers 2 \ --verbose \ --prepare-names \ "your/folder/EP*.txt"
SerializeEPO.py
Each file is serialized and the output is saved in your/folder/
as
<epo-file-name>.jsonl( .<suffix>)
. Nb: if the original file was compressed (.gz
), the serialized file will be compressed as well.
--max-workers
: Maximum number of threads allowed--verbose
/--no-verbose
: Display info on-going process--prepare-names
/--no-prepare-names
: Prepare names in line with BigQuery patents data standards--handle-html
/--no-handle-html
: Handle html--help
: Show this message and exit.