•News stories
–Domain-specific fine grained classes (denoted by *)
–Needs XHTML / CSS support
»
•Blocks can have multiple classes
–Multi-class forced to single
–Assessor picks most prominent class
»
•Resulting corpus has skewed distribution
–50 sites from Google News
–Not well-formed: Tidy first
•Main Content
•Site Navigation
•Search
•Supporting content
•Links supporting content
•Image supporting
content
•Sub headers
•Site image
•Advertisements*
•Links to related
articles*
•Newsletter / alert
links*
•Date or Time of
article*
•Source Station (country of
report)*
•Reporter Name*