Zhao et al. (2019)#
Publication#
MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
Repositories#
https://github.com/AIPHES/emnlp19-moverscore
Available Models#
This implementation wraps MoverScore
from the original repository.
The MoverScore
model leaves all of the optional parameters as default.
The MoverScoreForSummarization
uses a stopword file following their example code.
MoverScore
Description: A text generation evaluation metric
Name:
zhao2019-moverscore
Usage:
from repro.models.zhao2019 import MoverScore model = MoverScore() inputs = [ {"candidate": "The candidate summary", "references": ["The first reference", "The second"]} ] macro, micro = model.predict_batch(inputs)
The
macro
results are the MoverScore scores averaged over theinputs
. Themicro
results are the MoverScore results for each item ininputs
.
MoverScoreForSummarization
Description: A variant of
MoverScore
which uses stopwords by default based on the example code.Name:
zhao2019-moverscore-summarization
Usage:
from repro.models.zhao2019 import MoverScoreForSummarization model = MoverScoreForSummarization() inputs = [ {"candidate": "The candidate summary", "references": ["The first reference", "The second"]} ] macro, micro = model.predict_batch(inputs)
The
macro
results are the MoverScore scores averaged over theinputs
. Themicro
results are the MoverScore results for each item ininputs
.
Implementation Notes#
The current
moverscore
Python code does not appear to support GPUs, so even if you pass a GPU device todevice
, it will still run on CPU.
Docker Information#
Image name:
zhao2019
Build command:
repro setup zhao2019 [--silent]
Requires network: Yes, the library queries for a file even if the file is locally in the cache.
Testing#
repro setup zhao2019
pytest models/zhao2019/tests
Status#
[x] Regression unit tests pass
See here[ ] Correctness unit tests pass
None provided in the original repo.[ ] Model runs on full test dataset
Not tested[ ] Predictions approximately replicate results reported in the paper
Not tested[ ] Predictions exactly replicate results reported in the paper
Not tested
Changelog#
v1.2#
Added GPU support
v1.1#
Switched the implementation to use the
sentence_score
function instead of theword_mover_score
function directly. This now means that the IDF functionality is not implemented. This was changed because if you only passed 1 candidate to be scored, the IDF dict caused that to always receive a score of 1.0, and the score of one sentence depended on which other sentences were being scored at the same time.