CITIZENSHIP
Problem
The CIT text extracted from the text is just a span of natural language (e.g. "a citizen of the United States"). This cannot be used as such.
Approach
We use a Finite State Transducer. The task of the Finite State Transducer is to map these spans to a well-defined set of codes, here the ISO-3 code of the country of citizenship (e.g. USA). Below we evaluate the FST on US and GB data. Note: since there is no "learning" in the FST, overfitting is not really an issue and we do not distinguish between the training and the test set.
Results
| Data |
Accuracy with fuzzy-match |
Accuracy w/o no fuzzy-match |
gold_cit_gbpatentocr01.csv |
98.50% |
98.25% |
gold_cit_uspatentocr01.csv |
98.94% |
95.21% |
gold_cit_uspatentocr02.csv |
93.40% |
84.49% |
gold_cit_uspatentocr03.csv |
92.31% |
90.60% |
In all cases, the fuzzy-match improves the overall FST accuracy.
Snippet
# Eval on gold_cit_uspatentocr03.csv
python patentcity/eval.py cit-fst data/gold_cit_uspatentocr03.csv --fst-file lib/cit_fst.json --verbose
Error analysis
gold_cit_gbpatentocr01.csv
|
publication_number |
text |
gold |
pred |
res |
| 54 |
GB-1107922-A |
Corporation organized. |
|
NIU |
False |
| 66 |
GB-1124610-A |
citizens of the Federal -Republic of Germany |
DDR |
DEU |
False |
| 67 |
GB-1127891-A |
limited liability Company |
|
OMN |
False |
| 72 |
GB-1138246-A |
Corporation organised and existing under the laws of the S |
USA |
NIU |
False |
| 242 |
GB-1474266-A |
British subject and New Zealand citizen |
NZL |
GBR |
False |
| 320 |
GB-204013-A |
company incorporated under the laws of ureat Britain and Ireland |
GBR |
IRL |
False |
| 422 |
GB-362896-A |
Limited Liability Company |
|
OMN |
False |
| 442 |
GB-391456-A |
corporation organized and In |
|
NIU |
False |
| 471 |
GB-429108-A |
subject of the King of Great |
GBR |
|
False |
| 521 |
GB-508540-A |
Corporation organized deep achi under the laws of the State of West |
USA |
NIU |
False |
| 726 |
GB-859666-A |
corporation organized and existing under the laws of the |
|
NIU |
False |
| 774 |
GB-950313-A |
corporation organized the operative magnification ratio. |
|
NIU |
False |
gold_cit_uspatentocr01.csv
|
publication_number |
text |
gold |
pred |
res |
| 80 |
US-00832896-A1 |
CORPORATION OF, NEV |
USA |
|
False |
| 103 |
US-01047532-A1 |
CORPORATION OF NEW |
USA |
|
False |
| 173 |
US-01485740-A1 |
CORPORATION OF NEW |
USA |
|
False |
| 193 |
US-01208544-A1 |
citizen of the United |
USA |
NIU |
False |
| 386 |
US-00330257-A1 |
citizen of the Dominion of Can-ada |
CAN |
OMN |
False |
| 431 |
US-01249770-A1 |
CORPORATION OF PENNSYL-VANTA |
USA |
|
False |
gold_cit_uspatentocr02.csv
|
publication_number |
text |
gold |
pred |
res |
| 10 |
US-01731832-A1 |
CORPORATION CF. OTLIO |
USA |
LAO |
False |
| 14 |
US-01757421-A1 |
CORPORATION OF CONNECTI- |
USA |
|
False |
| 64 |
US-01740886-A1 |
COR-PORATION OF GEORGIA |
USA |
GEO |
False |
| 67 |
US-01659670-A1 |
CORPORATION OF ILLI-" NOIS |
USA |
|
False |
| 112 |
US-01838948-A1 |
CORPORATION OF MICHI |
USA |
|
False |
| 125 |
US-01677149-A1 |
CORPORATION OF NEW |
USA |
|
False |
| 126 |
US-01879349-A1 |
CORPORATION OF NEW 7 |
USA |
|
False |
| 177 |
US-01777067-A1 |
CORPORATION OF NEW" JERSEY |
USA |
JEY |
False |
| 178 |
US-01911978-A1 |
CORPORATION OF NEW. JERSEY |
USA |
JEY |
False |
| 179 |
US-01630895-A1 |
CORPORATION OF NEW. JERSEY |
USA |
JEY |
False |
| 180 |
US-01859075-A1 |
CORPORATION OF NEW. YORE |
USA |
|
False |
| 185 |
US-01756906-A1 |
CORPORATION OF NEWYORE |
USA |
|
False |
| 188 |
US-01717493-A1 |
CORPORATION OF OFTO |
USA |
|
False |
| 223 |
US-01598039-A1 |
CORPORATION OF PENN- |
USA |
|
False |
| 237 |
US-01666523-A1 |
CORPORATION OF PENNSYL-VANTA |
USA |
|
False |
| 238 |
US-01914412-A1 |
CORPORATION OF PENNSZL-VANTA |
USA |
|
False |
| 239 |
US-01704180-A1 |
CORPORATION OF RHODEISLAND |
USA |
LAO |
False |
| 243 |
US-01608767-A1 |
CORPORATION OF THAAS |
USA |
THA |
False |
| 249 |
US-01717172-A1 |
CORPORATION OF WIs-~CONSIN |
USA |
|
False |
| 284 |
US-01694877-A1 |
CORPORATIONYORK |
USA |
|
False |
gold_cit_uspatentocr03.csv
|
publication_number |
text |
gold |
pred |
res |
| 7 |
US-02344331-A1 |
corporation of Cali- |
USA |
MLI |
False |
| 42 |
US-02437791-A1 |
corporation of Hlinois |
USA |
|
False |
| 59 |
US-02905903-A1 |
corporation of Mlinois |
USA |
MLI |
False |
| 60 |
US-03012554-A1 |
corporation of New |
USA |
|
False |
| 78 |
US-02226153-A1 |
corporation of New. Jersey |
USA |
JEY |
False |
| 80 |
US-02169128-A1 |
corporation of Penn- |
USA |
|
False |
| 81 |
US-02853443-A1 |
corporation of Pennsyivania |
USA |
IRN |
False |
| 85 |
US-02992650-A1 |
corporation of Ulinois |
USA |
|
False |
| 98 |
US-02870832-A1 |
corporation ofIilinois |
USA |
FJI |
False |