Lexical Features
For each block:
POS tag distribution in text
Stemmed tokens weighted by TF×IDF
IDF from Stanford’s web base
Number of words
Alt text of images
Hyperlink type (e.g., embedded image, text, mailto)