Evaluation Metrics
Manually checked F3 measure
Based on essential/acceptable answer nuggets
NR – proportion of returned essential answer nuggets
NP – penalty to longer answers
Weighting NR 3 times as NP
Subject to inconsistent scoring among assessors
Automatic ROUGE score
Gold standard: sentences containing answer nuggets
Counting the trigrams shared in the gold standard and
system answers
ROUGE-3-ALL (R3A) and ROUGE-3-ESSENTIAL
(R3E)