Zhang et al. (2020)#

Publication#

BERTScore: Evaluating Text Generation with BERT

Repositories#

https://github.com/Tiiiger/bert_score

Available Models#

This implementation wraps BERTScore. Each of the backend models can be accessed by passing the name as an argument to the class constructor. Models can be pre-cached by passing its name to the setup command (see below). By default, the default English model is used unless otherwise specified.

BERTScore

Description: An text generation evaluation metric based on BERT.
Name: zhang2020-bertscore

Usage:

from repro.models.zhang2020 import BERTScore
model = BERTScore()
inputs = [
    {"candidate": "The candidate summary", "references": ["The first reference", "The second"]}
]
macro, micro = model.predict_batch(inputs)

The macro results are the BERTScore scores averaged over the inputs. The micro results are the BERTScore results for each item in inputs.

Implementation Notes#

If you are going to repeatedly use a model, it is better to run the setup command and pass that model’s name. Otherwise, every time you run the metric, the model will be downloaded again.
This implementation will return a score of 0 for precision, recall, and F1 if the input is empty. This is not true of the original code, which returns a non-zero recall when this is true.

Docker Information#

Image name: zhang2020
Build command:
```
repro setup zhang2020 \
  [--models <model-name>+] \
  [--silent]
```
The --models argument specifies which BERTScore backend models should be cached in the Docker image. By default, this includes roberta-large, the default model for English.
Requires network: Yes, it still tries to request for a file from the web even if it is cached locally.

Testing#

repro setup zhang2020
pytest models/zhang2020/tests

Status#

[x] Regression unit tests pass
[x] Correctness unit tests pass
See here. The tests check for the same scores as in the original repo’s unit tests.
[ ] Model runs on full test dataset
Not tested
[ ] Predictions approximately replicate results reported in the paper
Not tested
[ ] Predictions exactly replicate results reported in the paper
Not tested