Sellam et al. (2020)#

Publication#

BLEURT: Learning Robust Metrics for Text Generation

Repositories#

https://github.com/google-research/bleurt

Available Models#

The BLEURT class can be instantiated with the checkpoints provided by the original repository. See here for the list. The corresponding model names are "BLEURT-20", "BLEURT-20-{D12,D6,D3}" or "bleurt-{tiny,base,large}-{128,512}" and should be passed to the constructor of the class.

BLEURT

Description: A learned evaluation metric for natural language generation
Name: sellam2020-bleurt

Usage:

from repro.models.sellam2020 import BLEURT
model = BLEURT(model="BLEURT-20")
inputs = [
    {"candidate": "The candidate text", "references": ["The reference", "The other reference"]}
]
scores = model.predict_batch(inputs)

Implementation Notes#

The original BLEURT code only supports single references. Our implementation return both the mean and the max BLEURT score over the references (they will be equal if there is only 1 reference).

Docker Information#

Image name: sellam2020

Build command:

repro setup sellam2020 \
  [--not-tiny-128] \
  [--not-base-128] \
  [--not-bleurt-20] \
  [--tiny-512] \
  [--base-512] \
  [--large-128] \
  [--large-512] \
  [--bleurt-20-d12] \
  [--bleurt-20-d6] \
  [--bleurt-20-d3] \
  [--silent]

The arguments specify which BLEURT models should be downloaded. BLEURT-20, bleurt-tiny-128, and bleurt-base-128 are downloaded by default.

Requires network: No

Testing#

Explain how to run the unittests for this model

repro setup sellam2020
pytest models/sellam2020/tests

Status#

[x] Regression unit tests pass
[x] Correctness unit tests pass
The unit tests are based on examples in the official repository. See here.
[ ] Model runs on full test dataset
Not tested
[ ] Predictions approximately replicate results reported in the paper
Not tested
[ ] Predictions exactly replicate results reported in the paper
Not tested

Changelog#

v1.1#

Upgraded to set BLEURT-20 as the default model and use the faster length-batched implementation