Zhao et al. (2019)#

Publication#

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

Repositories#

https://github.com/AIPHES/emnlp19-moverscore

Available Models#

This implementation wraps MoverScore from the original repository. The MoverScore model leaves all of the optional parameters as default. The MoverScoreForSummarization uses a stopword file following their example code.

MoverScore

Description: A text generation evaluation metric
Name: zhao2019-moverscore

Usage:

from repro.models.zhao2019 import MoverScore
model = MoverScore()
inputs = [
    {"candidate": "The candidate summary", "references": ["The first reference", "The second"]}
]
macro, micro = model.predict_batch(inputs)

The macro results are the MoverScore scores averaged over the inputs. The micro results are the MoverScore results for each item in inputs.

MoverScoreForSummarization
- Description: A variant of MoverScore which uses stopwords by default based on the example code.
- Name: zhao2019-moverscore-summarization
- Usage:
```
from repro.models.zhao2019 import MoverScoreForSummarization
model = MoverScoreForSummarization()
inputs = [
    {"candidate": "The candidate summary", "references": ["The first reference", "The second"]}
]
macro, micro = model.predict_batch(inputs)
```
  The macro results are the MoverScore scores averaged over the inputs. The micro results are the MoverScore results for each item in inputs.

Implementation Notes#

The current moverscore Python code does not appear to support GPUs, so even if you pass a GPU device to device, it will still run on CPU.

Docker Information#

Image name: zhao2019
Build command:
```
repro setup zhao2019 [--silent]
```
Requires network: Yes, the library queries for a file even if the file is locally in the cache.

Testing#

repro setup zhao2019
pytest models/zhao2019/tests

Status#

[x] Regression unit tests pass
See here
[ ] Correctness unit tests pass
None provided in the original repo.
[ ] Model runs on full test dataset
Not tested
[ ] Predictions approximately replicate results reported in the paper
Not tested
[ ] Predictions exactly replicate results reported in the paper
Not tested

Changelog#

v1.2#

Added GPU support

v1.1#

Switched the implementation to use the sentence_score function instead of the word_mover_score function directly. This now means that the IDF functionality is not implemented. This was changed because if you only passed 1 candidate to be scored, the IDF dict caused that to always receive a score of 1.0, and the score of one sentence depended on which other sentences were being scored at the same time.