Papineni et al. (2002)#

Publication#

BLEU: A Method for Automatic Evaluation of Machine Translation

Repositories#

https://github.com/mjpost/sacrebleu

Available Models#

Our implementation wraps BLEU and the sentence-level version, SentBLEU.

  • BLEU

    • Description: The BLEU metric

    • Name: papineni2002-bleu

    • Usage:

      from repro.models.papineni2002 import BLEU
      model = BLEU()
      inputs = [
          {"candidate": "The candidate", "references": ["Reference one", "The second"]},
          ...
      ]
      macro, _ = model.predict_batch(inputs)
      

      macro will contain the BLEU score. Since BLEU is a corpus-level metric, there is not input-level score (thus we ignore the second returned value, which is an empty list)

  • SentBLEU

    • Description: The sentence-level BLEU metric

    • Name: papineni2002-sentbleu

    • Usage:

      from repro.models.papineni2002 import SentBLEU
      model = SentBLEU()
      inputs = [
          {"candidate": "The candidate", "references": ["Reference one", "The second"]},
          ...
      ]
      macro, micro = model.predict_batch(inputs)
      

      macro contains the average SentBLEU score over the inputs. micro contains the SentBLEU score for each input.

Implementation Notes#

Docker Information#

  • Image name: papineni2002

  • Build command:

    repro setup papineni2002 [--silent]
    
  • Requires network: No

Testing#

repro setup papineni2002
pytest models/papineni2002/tests

Status#

  • [x] Regression unit tests pass

  • [x] Correctness unit tests pass
    See here. We check a corpus-level example from the SacreBLEU unit tests. We did not test for SentBLEU’s correctness or BLEU with a different number of references per hypothesis.

  • [ ] Model runs on full test dataset
    Not tested

  • [ ] Predictions approximately replicate results reported in the paper
    n/a

  • [ ] Predictions exactly replicate results reported in the paper
    n/a