Papineni et al. (2002)#

Publication#

BLEU: A Method for Automatic Evaluation of Machine Translation

Repositories#

https://github.com/mjpost/sacrebleu

Available Models#

Our implementation wraps BLEU and the sentence-level version, SentBLEU.

BLEU

Description: The BLEU metric
Name: papineni2002-bleu

Usage:

from repro.models.papineni2002 import BLEU
model = BLEU()
inputs = [
    {"candidate": "The candidate", "references": ["Reference one", "The second"]},
    ...
]
macro, _ = model.predict_batch(inputs)

macro will contain the BLEU score. Since BLEU is a corpus-level metric, there is not input-level score (thus we ignore the second returned value, which is an empty list)

SentBLEU

Description: The sentence-level BLEU metric
Name: papineni2002-sentbleu

Usage:

from repro.models.papineni2002 import SentBLEU
model = SentBLEU()
inputs = [
    {"candidate": "The candidate", "references": ["Reference one", "The second"]},
    ...
]
macro, micro = model.predict_batch(inputs)

macro contains the average SentBLEU score over the inputs. micro contains the SentBLEU score for each input.

Implementation Notes#

Docker Information#

Image name: papineni2002
Build command:
```
repro setup papineni2002 [--silent]
```
Requires network: No

Testing#

repro setup papineni2002
pytest models/papineni2002/tests

Status#

[x] Regression unit tests pass
[x] Correctness unit tests pass
See here. We check a corpus-level example from the SacreBLEU unit tests. We did not test for SentBLEU’s correctness or BLEU with a different number of references per hypothesis.
[ ] Model runs on full test dataset
Not tested
[ ] Predictions approximately replicate results reported in the paper
n/a
[ ] Predictions exactly replicate results reported in the paper
n/a