Before starting

Compress data

Both EPO .txt files and serailized .jsonl files are uncompressed. Hence, you can save a significant amount of space by compressing these files.

Snippet
gzip your/folder/EP*.*

Warning

Although bq load can handle compressed files, it cannot read this type of files asynchronously. This means that loading the full database from compressed EP*.jsonl.gz files will last much longer (around 15 minutes). In short, you had better make sure that the database is properly loaded before compressing the EP*.jsonl files.

Check data

Validate schema

You can validate the json schema of each patent in the .jsonl files using the validate-schema.py CLI.

If there is an error, it will send a Warning with the file name, the line number and the json object which raised the error.

Validate schema
python bin/validate-schema.py "your/folder/EP*.jsonl"

Number of lines by EP0*.jsonl files

You can count the number of lines in the .jsonl objects to make sure that the serialization job did not fail silently.

Expected results here.

Count lines
    for file in $(find you/folder/EP*.jsonl); do
    > echo "$file";
    > wc -l "$file";
    > done

Duplicates

There are 954 publication_number duplicates. Manual inspection shows that they differ by their publication dates.

Duplicates
Number of duplicates
WITH tmp AS (
    SELECT
      publication_number,
      COUNT(publication_number) AS count_
    FROM
      `project.dataset.epo_fulltext`
    GROUP BY
      publication_number)
  SELECT
    COUNT(DISTINCT(publication_number)) AS nb_pubnum_duplicates
  FROM
    tmp
  WHERE
    count_>1
Insights on duplicates
  WITH
    tmp AS (
    SELECT
      publication_number,
      COUNT(publication_number) AS count_
    FROM
      `project.dataset.epo_fulltext`
    GROUP BY
      publication_number)
  SELECT
    *
  FROM
    tmp
  WHERE
    count_>1)
SELECT
  dup.publication_number,
  epo.publication_date,
  epo.abstract,
  epo.description,
  epo.title
FROM
  dup,
  `npl-parsing.external.epo_fulltext` AS epo
WHERE
  dup.publication_number = epo.publication_number
ORDER BY
  publication_number