Target Classification
Main Content
Site Navigation
Search
Supporting content
Links supporting content
Image supporting content
Sub headers
Site image
Advertisements*
Links to related articles*
Newsletter / alert links*
Date or Time of article*
Source Station (country of report)*
Reporter Name*
News stories
Domain-specific fine grained
classes (denoted by *)
Needs XHTML / CSS
support
Blocks can have multiple
classes
Multi-class forced to single
Assessor picks most
prominent class
Resulting corpus has skewed
distribution
50 sites from Google News
Not well-formed: Tidy first