DATE IMPUTATION¶
Problem¶
Before 1972 (incl), the publication date of DD patents is missing. Frontier is fuzzy, patents publication numbers are not exactly chronological but nearly. This makes it hard to manually find the latest publication number for each vintage.
Note
The above figure reports the patent number (x-axis) and the publication year (ausgabe datum) labeled by hand of a random sample of 1k+ DD patents with missing date.
Approach¶
The idea is to look iteratively at each publication year and find the best threshold (in terms of f1) to delimit between the year and year+1. We obtain a stepwise prediction function where each threshold can be characterized by an f1-score indicating how "good" the threshold is.
Results¶
The overall accuracy of the prediction function is 93% on the training set (for the sake of simplicity, we don't have a test set)
Date imputation¶
Reading
Patents with number below 3 are imputed publication year 1951, between 3 and 1723 are imputed publication year 1952, etc
year | threshold |
---|---|
1951 | DD-3 |
1952 | DD-1723 |
1953 | DD-6164 |
1954 | DD-8769 |
1955 | DD-10939 |
1956 | DD-12386 |
1957 | DD-14208 |
1958 | DD-16107 |
1959 | DD-18028 |
1960 | DD-20460 |
1961 | DD-22493 |
1962 | DD-24412 |
1963 | DD-26646 |
1964 | DD-34886 |
1965 | DD-44171 |
1966 | DD-53027 |
1967 | DD-59516 |
1968 | DD-65066 |
1969 | DD-70534 |
1970 | DD-78709 |
1971 | DD-86784 |