Stylistic and lexical co-training for webpage block classification
Chee How Lee, Min-Yen Kan and Sandra Lai | |
National University of Singapore | |
kanmy@comp.nus.edu.sg | |
What’s a web page block? | ||
Parts of a web page with different functions | ||
e.g., main content, site navigation, advertisements | ||
Important for this distinction | ||
Different names for the same thing | ||
fragments, elements, blocks, tiles |
Extract content to mobile device | |||||
“Just the facts, ma’am” | |||||
summarization = better (whole) page classification | |||||
Advertisement blocking | |||||
Fragment versioning | |||||
Distinguish navigation from content | |||||
better link-based ranking |
Earliest systems use hard-coded wrappers | |||
Content-focused (E.g., largest table cell) | |||
Didn’t scale | |||
Now, multi-class classification using mixed features: lexical, structural and spatial. | |||
HTML Path structure | |||
(Yang et al. 03, Shih and Karger 04) | |||
Spatial random walk using browser-exposed DOM tree | |||
Allows precise layout information (Xin and Lee 04) |
A obvious approach is to build a supervised classifier | ||
Train on labeled examples (f1,f2,…,fi,…,fn, C) | ||
Test by distilling features (f1,f2,…,fi,…,fn) = ? | ||
Training data costly, need to use unlabeled data | ||
The feature sets are largely orthogonal | ||
= Try co-training! |
Co-training (Blum and Mitchell)
Two learners with separate views of the same problem | ||
Characterize this as the example of classifying web pages | ||
Link structure | ||
Text on the page |
Use one classifier to help the other | |||||
e.g. pages that the link classifier is confident on should be passed as correct answers to the text-based classifier. | |||||
Assumes that the individual classifiers are not bad to start with | |||||
Otherwise noise level will escalate |
B&M co-training handles only binary classification | |
Handles distribution skewing |
PARser for Content Extraction & Layout Structure | ||
Goals: | ||
Coarse-grained classification | ||
Fine-grained information extraction | ||
Work on a variety of sources | ||
Open-source, reference implementation | ||
News stories | |||||
Domain-specific fine grained classes (denoted by *) | |||||
Needs XHTML / CSS support | |||||
Blocks can have multiple classes | |||||
Multi-class forced to single | |||||
Assessor picks most prominent class | |||||
Resulting corpus has skewed distribution | |||||
50 sites from Google News | |||||
Not well-formed: Tidy first | |||||
Main Content | |||||
Site Navigation | |||||
Search | |||||
Supporting content | |||||
Links supporting content | |||||
Image supporting content | |||||
Sub headers | |||||
Site image | |||||
Advertisements* | |||||
Links to related articles* | |||||
Newsletter / alert links* | |||||
Date or Time of article* | |||||
Source Station (country of report)* | |||||
Reporter Name* |
Lexical and Stylistic Co-training
Split the document into blocks using DOM tree | |||||
Nontrivial (overlapping blocks, visual segments differ) | |||||
Co-train | |||||
Learner 1 – Stylistic learner | |||||
Spatial and structural relationship | |||||
External relationship to other blocks | |||||
Learner 2 – Lexical learner | |||||
POS and link related features | |||||
Internal classification irrespective of other blocks |
Layout: guess from first level DOM nodes | |||||
Linear | |||||
<Table>: Use reading order, cell type propagation | |||||
XHTML / CSS (e.g., <DIV>): Translate relative to absolute positioning, model depth | |||||
Font (CSS too): relative features | |||||
Image size | |||||
For each block: | |||||
POS tag distribution in text | |||||
Stemmed tokens weighted by TF×IDF | |||||
IDF from Stanford’s web base | |||||
Number of words | |||||
Alt text of images | |||||
Hyperlink type (e.g., embedded image, text, mailto) | |||||
Adapted co-training: | |||||
Sample balancing: preserve ratio of noisily labeled examples, poor performance without it | |||||
Replace unlabeled data at each round | |||||
Use BoosTexter: handles word features easily | |||||
Five fold cross validation | |||||
General performance? | |||||
Specific performance on: | |||||
Fine-grained classification? | |||||
XHTML / DIV pages? | |||||
Others’ tasks? |
Statistically significant improvement | ||
Improvement on large classes at expense of minority | ||
Despite sample balancing | ||
No fine grained classes detected |
Smaller dataset | ||
1/5 the size, limited sites for sample | ||
Both annotated and unannotated data sets were smaller | ||
As a result, fewer co-training iterations | ||
Single view model still seems to do better |
Slightly different model of splitting than earlier work | |
Smaller amount of training examples | |
No significant gain from co-training but comparable to other work (19.5% error vs. 14-18 error%) |
Co-training model for web block classification | |||||
Achieves 28.5% reduction in error in main task | |||||
However, fails in | |||||
Detecting fine grained classes | |||||
→ Exploit templates, IE methods, path similarity and context | |||||
Likely needs enough unlabeled data | |||||
→ Re-run using more experimental data | |||||
Dependent on learning model | |||||
→ Looking to change learning package |
Any questions? | |
http://parcels.sourceforge.net/ | |
Available in late November 2004 | |
Annotator, evaluation tools provided | |
Handles XHTML and DIV / CSS | |
Open source, GPL’ed code |