Deutsch et al. (2021)#

Publication#

Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

Repositories#

https://github.com/danieldeutsch/qaeval

Available Models#

We have implemented the QAEval metric as well as its question-generation and question-answering models.

  • QAEval:

    • Description: A question-answering reference-based summarization evaluation metric.

    • Name: deutsch2021-qaeval

    • Usage:

      from repro.models.deutsch2021 import QAEval
      model = QAEval()
      inputs = [
          {"candidate": "The candidate summary", "references": ["The first reference", "The second"]}
      ]
      macro, micro = model.predict_batch(inputs)
      

      The macro results are the QAEval scores averaged over the inputs. The micro results are the QAEval results for each item in inputs.

      You can also return the QA pairs for each input with the return_qa_pairs=True flag:

      macro, micro, qa_pairs = model.predict_batch(inputs, return_qa_pairs=True)
      
  • Question Generation

    • Description: A question-generation model

    • Name: deutsch2021-question-generation

    • Usage:

      from repro.models.deutsch2021 import QAEvalQuestionGenerationModel
      model = QAEvalQuestionGenerationModel()
      context = "My name is Dan."
      start, end = 11, 14  # "Dan", end is exclusive
      question = model.predict(context, start, end)
      
  • Question Answering

    • Description: A question-answering model trained on SQuAD 2.0

    • Name: deutsch2021-question-answering

    • Usage:

      from repro.models.deutsch2021 import QAEvalQuestionAnsweringModel
      model = QAEvalQuestionAnsweringModel()
      context = "My name is Dan."
      question = "What is my name?"
      # If the answer is `None`, that means the model predicted that
      # the question is not answerable
      answer = model.predict(context, question)
      # Add `return_dicts=True` to get more metadata about the answer.
      # The "prediction" key will be the model's best non-null choice
      answer_dict = model.predict(context, question, return_dicts=True)
      

Implementation Notes#

Docker Information#

  • Image name: deutsch2021

  • Docker Hub: https://hub.docker.com/repository/docker/danieldeutsch/deutsch2021

  • Build command:

    repro setup deutsch2021 [--silent]
    
  • Requires network: No

Testing#

repro setup deutsch2021
pytest models/deutsch2021/tests

Status#

  • [x] Regression unit tests pass

  • [x] Correctness unit tests pass
    The tests were taken from the qaeval and sacrerouge repositories. See here (these tests don’t include the QAEval metric test, which likely takes too long on the CPU).

  • [ ] Model runs on full test dataset
    Not tested

  • [x] Predictions approximately replicate results reported in the paper
    The question-answering model replicates the expected results (see here). The question-generation model was not quantitatively evaluated in the paper. We did not test QAEval on the full dataset, but the scores match the examples from the qaeval and sacrerouge repos.

  • [ ] Predictions exactly replicate results reported in the paper
    Not tested