

Before starting¶

Compress data¶

Both EPO .txt files and serailized .jsonl files are uncompressed. Hence, you can save a significant amount of space by compressing these files.

Snippet

gzip your/folder/EP*.*

Warning

Although bq load can handle compressed files, it cannot read this type of files asynchronously. This means that loading the full database from compressed EP*.jsonl.gz files will last much longer (around 15 minutes). In short, you had better make sure that the database is properly loaded before compressing the EP*.jsonl files.

Check data¶

Validate schema¶

You can validate the json schema of each patent in the .jsonl files using the validate-schema.py CLI.

If there is an error, it will send a Warning with the file name, the line number and the json object which raised the error.

Validate schema

python bin/validate-schema.py "your/folder/EP*.jsonl"

Number of lines by EP0*.jsonl files¶

You can count the number of lines in the .jsonl objects to make sure that the serialization job did not fail silently.

Expected results here.

Count lines

    for file in $(find you/folder/EP*.jsonl); do
    > echo "$file";
    > wc -l "$file";
    > done

Duplicates¶

There are 954 publication_number duplicates. Manual inspection shows that they differ by their publication dates.

Duplicates

Number of duplicates

WITH tmp AS (
    SELECT
      publication_number,
      COUNT(publication_number) AS count_
    FROM
      `project.dataset.epo_fulltext`
    GROUP BY
      publication_number)
  SELECT
    COUNT(DISTINCT(publication_number)) AS nb_pubnum_duplicates
  FROM
    tmp
  WHERE
    count_>1

Insights on duplicates

  WITH
    tmp AS (
    SELECT
      publication_number,
      COUNT(publication_number) AS count_
    FROM
      `project.dataset.epo_fulltext`
    GROUP BY
      publication_number)
  SELECT
    *
  FROM
    tmp
  WHERE
    count_>1)
SELECT
  dup.publication_number,
  epo.publication_date,
  epo.abstract,
  epo.description,
  epo.title
FROM
  dup,
  `npl-parsing.external.epo_fulltext` AS epo
WHERE
  dup.publication_number = epo.publication_number
ORDER BY
  publication_number