CHEATSHEET¶
Compressing prodigy gold .jsonl (from prodigy db-out
)¶
Info
When exporting ENT annotated data using prodigy db-out <dataset>
, annotations belonging to the same text are not merged, causing issues when you want to use the data for another task (e.g. correction or REL annotation). The below recipe takes care to do the merge.
FILE="" # should be a .jsonl
mv ${FILE} ${FILE}_tmp && cat ${FILE}_tmp | grep -v '"spans":\[\]' | grep spans |jq -s -c 'group_by(.publication_number)[] | { publication_number: .[0].publication_number, text: .[0].text, tokens: .[0].tokens, spans:[.[].spans[]]}' >> ${FILE}
Head-child switch¶
Info
Original REL arc labelling convention went from attributes (e.g. LOC
, CIT
, OCC
) to patentees (ASG
, INV
), which was counter-intuitive in terms of head
/child
and performance metrics (although) tractable. The below recipe makes the switch between head and child to make downstream REL handling easier.
for file in $(ls gold_rel_*.jsonl); do sed 's/\"child/$tmp/g;s/\"head/\"child/g;s/$tmp/\"head/g' ${file} >> ${file}_corr; done;
Parallel models training¶
Info
Training models takes time, it's better to train them in parallel. Don't start too many jobs at once. On a mac mini,each job takes up to 2 CPUs.
LANG=de # support for en fr
OFFICE=de # support dd fr gb us
cat lib/format.txt| grep $OFFICE | parallel -j 2 --eta 'spacy train configs/${LANG}_t2vner.cfg --paths.train data/train_ent_{}.spacy --paths.dev data/train_ent_{}.spacy --output models/${LANG}_ent_{}'
Extract sample for kepler¶
OFFICE="" # e.g. DD, DE, etc
RATIO= # e.g. .2, .015
patentcity io extract-sample-kepler patentcity.patentcity.pc_v100rc1 data_tmp/sample_${OFFICE}.csv --sample-ratio ${RATIO} --office ${OFFICE} --key-file credentials-patentcity.json
Extract data¶
RELEASE="v100rc5"
bq extract --destination_format NEWLINE_DELIMITED_JSON --compression GZIP patentcity:patentcity.${RELEASE} "gs://patentcity_dev/beta/${RELEASE}_*.jsonl.gz"
patentcity io prep-csv-extract patentcity.patentcity.${RELEASE} patentcity.stage.${RELEASE} credentials-patentcity.json
bq extract --destination_format CSV --compression GZIP patentcity:stage.${RELEASE} "gs://patentcity_dev/beta/${RELEASE}_*.csv.gz"