Participants
graded on a 9-point Likert scale
We also simplified
scale to a binary class
(1-2 → yes; 3-9 → no)
Lets look at two examples:
Practical digital libraries
Practical digital archiving
Query judgments are subjective, may depend on subject familiarity. Thus, we calculate inter-judge agreement
to:
establish whether the tasks are well-defined
establish performance upper bound