Durmus et al. (2020)#
Publication#
Repositories#
https://github.com/esdurmus/feqa
Available Models#
FEQA
Description: A document-based faithfulness evaluation metric
Name:
durmus2020-feqa
Usage:
from repro.models.durmus2020 import FEQA model = FEQA() inputs = [ {"candidate": "The candidate text", "sources": ["The source"]} ] macro, micro = model.predict_batch(inputs)
The output
macro
is the FEQA score averaged over the inputs.micro
is the invidiaul FEQA scores per input.
Implementation Notes#
FEQA only supports single source documents, so the length of
sources
must be 1The example shows that
en_core_web_sm==2.1.0
. However, when we ran the code with that version, we encountered the following error:ValueError: [E167] Unknown morphological feature: ‘ConjType’ (9141427322507498425). This can happen if the tagger was trained with a different set of morphological features. If you’re using a pretrained model, make sure that your models are up to date
We instead use
en_core_web_sm==2.3.1
, which did not cause this error.We are unable to test this implementation on a GPU because the process consumes more than 12gb of memory. The error comes during the question-answering step when the SQuAD model is called.
Docker Information#
Image name:
durmus2020
Build command:
repro setup durmus2020 [--silent]
Requires network: No
Testing#
repro setup durmus2020
pytest models/durmus2020/tests
Status#
[x] Regression unit tests pass
We were unable to run tests on the MultiLing2011 data that we have used to regression testing other metrics because the process runs out of memory on the GPU (ours is 12gb) and the CPU version is too slow. However, we do run a test on some toy examples.[x] Correctness unit tests pass
See here. The example input and output ran successfully when we useden_core_web_sm==2.1.0
, but after we changed toen_core_web_sm==2.3.1
, one of the examples changed its score slightly. However, it is close enough that we consider it to be ok especially since it was correct with the original Spacy model.[ ] Model runs on full test dataset
Not tested[ ] Predictions approximately replicate results reported in the paper
Not tested[ ] Predictions exactly replicate results reported in the paper
Not tested