Skip to content

DATE IMPUTATION

Problem

Before 1972 (incl), the publication date of DD patents is missing. Frontier is fuzzy, patents publication numbers are not exactly chronological but nearly. This makes it hard to manually find the latest publication number for each vintage.

Note

The above figure reports the patent number (x-axis) and the publication year (ausgabe datum) labeled by hand of a random sample of 1k+ DD patents with missing date.

Approach

The idea is to look iteratively at each publication year and find the best threshold (in terms of f1) to delimit between the year and year+1. We obtain a stepwise prediction function where each threshold can be characterized by an f1-score indicating how "good" the threshold is.

Results

The overall accuracy of the prediction function is 93% on the training set (for the sake of simplicity, we don't have a test set)

Date imputation

Reading

Patents with number below 3 are imputed publication year 1951, between 3 and 1723 are imputed publication year 1952, etc

year threshold
1951 DD-3
1952 DD-1723
1953 DD-6164
1954 DD-8769
1955 DD-10939
1956 DD-12386
1957 DD-14208
1958 DD-16107
1959 DD-18028
1960 DD-20460
1961 DD-22493
1962 DD-24412
1963 DD-26646
1964 DD-34886
1965 DD-44171
1966 DD-53027
1967 DD-59516
1968 DD-65066
1969 DD-70534
1970 DD-78709
1971 DD-86784