Skip to content

MODEL ARCHITECTURES

A Model is made of data, a tok2vec, a model architecture and training specificities. This page documents the built-in models used for the pruning step and how they can be used/parametrized through human-readable config files.

Info

Under the hood, we use keras. Components model and training can easily be extended to other keras parameters.

data

The data component refers to the data used for training/testing the model.

Name Description Type
train Training data file path Path
test Test data file path Path
Example
data:
  train: data/train.jsonl
  test: data/test.jsonl

Data format

train/test are expected to be JSONL files where each row contains a field "text" (the abstract) and a field "is_seed" (the manual annotation) with values 1 if the candidate was classified in the seed, 0 otherwise.

# Example
{"text": "fluctuat nec mergitur", "is_seed": 1}

tok2vec

The tok2vec component refers to the text vectorization stage - that is the process going from raw text to data that can be ingested by the model.

In the mlp case, we vectorize the text data using the top_k features of a TfIdfVectorizer fitted on the training data. The top_k features are determined by the f_classif.

Name Description Recommended values Type
ngram_range The lower and upper boundary of the range of n-values for different n-grams to be extracted. [1,2], [1,3] List[int]
dtype Type of the matrix returned. float32 numeric
strip_accents Remove accents and perform other character normalization during the preprocessing step. "unicode" is slightly slower but works on any character. "unicode" {"ascii", "unicode"}
decode_error Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. "replace" {"strict", "ignore", "replace"}
analyzer Whether the feature should be made of word or character n-grams. "word" {"word", "char", "char_wb"}
min_df When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. 1, 2, 4 int
top_k Select features according to the k f_classif highest scores. 5000 int
Example
tok2vec:
  ngram_range: [1,2]
  dtype: float32
  strip_accents: unicode
  decode_error: replace
  analyzer: word
  min_df: 2
  top_k: 5000

In the cnn case, we tokenize the text data using keras.preprocessing.text.Tokenizer and truncate/padd the sequence to max_length. only the top_k tokens (by frequency) are used.

Name Description Recommended values Type
max_length Max length of a tokenized text seqence. Above that figure, sequence is truncated, below, it is padded. 250 int
top_k Only the most common num_words-1 words will be kept. 5000 int
Example
max_length: 250
top_k: 5000

model

The model component defines the model's architecture, hyper-parameters and optimizer.

Name Description Recommended values Type
architecture High level model architecture family name. "mlp" "mlp"
layers Number of hidden layers. Nb: if 1, then logistic regression. 1, 2, 4 int
units Number of units per hidden layer. 16, 32, 64 int
dropout_rate Fraction of the input units to drop. 0, 0.2 float

Note

Dropout is applied to hidden layers only.

Example
model:
    architecture: mlp
    layers: 1
    units: 64
    dropout_rate: .2
Name Description Recommended values Type
architecture High level model architecture family name. "cnn" "cnn"
blocks Number of Convolution-Pooling pairs. 1, 2, 4 int
filters Dimensionality of the output space (i.e. the number of output filters in the convolution). 16, 32, 64 int
dropout_rate Percentage of input units to drop at Dropout Layer. 0, 0.2 float
embedding_dim Dimension of the embedding vectors (reco between 50 and 200). 50, 100, 200 int
kernel_size Length of the 1D convolution window. 2, 4, 8 int
pool_size Factor by which to downscale input at MaxPooling layer. 2, 3, 4 int
use_pretrained_embedding True if pre-trained. Else False. False bool
is_embedding_trainable Used only use_pretrained_embedding. True if pre-trained embedding is trainable. Else False False bool
Example
model:
  architecture: cnn
  blocks: 2
  filters: 64
  dropout_rate: .2
  embedding_dim: 100
  kernel_size: 5
  pool_size: 3
  use_pretrained_embedding: False
  is_embedding_trainable: False 

Optimizer

The optimizer is nested in the model section. The same parameters are available for both architectures

Name Description Recommended values Type
learning_rate Learning rate. 1e-3 float
loss Loss of the output layer. "binary_crossentropy" is warmly recommended since we are in a binary classification setting. See keras doc for available losses "binary_crossentropy" str
metrics Metrics to record in model.history (also used in callbacks for early stopping). See keras doc for available metrics. ["accuracy"] List[str]
Example
model:
  ...
  optimizer:
    learning_rate: 1e-3
    loss: binary_crossentropy
    metrics: [ "accuracy" ]

Training

The training component sets the training parameters.

Name Description Recommended values Type
epochs Number of training epochs (full training dataset iteration). 20, 50, 100 int
batch_size Size of the batch used for between 2 model updates. 16, 32, 64 int
Example
training:
  epochs: 100
  batch_size: 64

Callbacks

Callbacks component is nested in training. It sets the parameters used for early-stopping. Seed keras doc for more.

Name Description Recommended values Type
monitor Quantity to be monitored. "val_loss" "val_loss"
patience Number of epochs with no improvement after which training will be stopped. 2, 5 int
Example
training:
  ...
  callbacks:
    monitor: val_loss
    patience: 2

Logger

The component logger sets the verbosity level of the model training.

Name Description Recommended values Type
verbose Verbosity level of the model training. 'auto', 0, 1, or 2. Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch. 2 int
Example
logger:
  verbose: 2