Lexical Features
•
For each block:
–
POS tag distribution in text
–
Stemmed tokens weighted by TF×IDF
•
IDF from Stanford’s web base
–
Number of words
–
Alt text of images
–
Hyperlink type (e.g., embedded image, text, mailto)