13 Nov 2004
WIDM 04: Lee et al. Co-training Web Block Classification
13
Lexical Features
•
For each block:
–
POS tag distribution in text
»
–
Stemmed tokens weighted by TF
×
IDF
•
IDF from Stanford’s web base
»
–
Number of words
»
–
Alt text of images
»
–
Hyperlink type (e.g., embedded image, text, mailto)
•