Stylistic and lexical co-training for webpage block classification

Evaluations

•

Adapted co-training:

–

Sample balancing: preserve ratio of noisily labeled

examples, poor performance without it

–

Replace unlabeled data at each round

•

Use BoosTexter: handles word features easily

•

Five fold cross validation

•

General performance?

•

Specific performance on:

–

Fine-grained classification?

–

XHTML / DIV pages?

–

Others’ tasks?