Zhang et al. (2020)#
Publication#
Repositories#
https://github.com/Tiiiger/bert_score
Available Models#
This implementation wraps BERTScore.
Each of the backend models can be accessed by passing the name as an argument to the class constructor.
Models can be pre-cached by passing its name to the setup
command (see below).
By default, the default English model is used unless otherwise specified.
BERTScore
Description: An text generation evaluation metric based on BERT.
Name:
zhang2020-bertscore
Usage:
from repro.models.zhang2020 import BERTScore model = BERTScore() inputs = [ {"candidate": "The candidate summary", "references": ["The first reference", "The second"]} ] macro, micro = model.predict_batch(inputs)
The
macro
results are the BERTScore scores averaged over theinputs
. Themicro
results are the BERTScore results for each item ininputs
.
Implementation Notes#
If you are going to repeatedly use a model, it is better to run the
setup
command and pass that model’s name. Otherwise, every time you run the metric, the model will be downloaded again.This implementation will return a score of 0 for precision, recall, and F1 if the input is empty. This is not true of the original code, which returns a non-zero recall when this is true.
Docker Information#
Image name:
zhang2020
Build command:
repro setup zhang2020 \ [--models <model-name>+] \ [--silent]
The
--models
argument specifies which BERTScore backend models should be cached in the Docker image. By default, this includesroberta-large
, the default model for English.Requires network: Yes, it still tries to request for a file from the web even if it is cached locally.
Testing#
repro setup zhang2020
pytest models/zhang2020/tests
Status#
[x] Regression unit tests pass
[x] Correctness unit tests pass
See here. The tests check for the same scores as in the original repo’s unit tests.[ ] Model runs on full test dataset
Not tested[ ] Predictions approximately replicate results reported in the paper
Not tested[ ] Predictions exactly replicate results reported in the paper
Not tested