Stylistic and lexical co-training for webpage block classification
| Chee How Lee, Min-Yen Kan and Sandra Lai | |
| National University of Singapore | |
| kanmy@comp.nus.edu.sg | |
| What’s a web page block? | ||
| Parts of a web page with different functions | ||
| e.g., main content, site navigation, advertisements | ||
| Important for this distinction | ||
| Different names for the same thing | ||
| fragments, elements, blocks, tiles | ||
| Extract content to mobile device | |||||
| “Just the facts, ma’am” | |||||
| summarization = better (whole) page classification | |||||
| Advertisement blocking | |||||
| Fragment versioning | |||||
| Distinguish navigation from content | |||||
| better link-based ranking | |||||
| Earliest systems use hard-coded wrappers | |||
| Content-focused (E.g., largest table cell) | |||
| Didn’t scale | |||
| Now, multi-class classification using mixed features: lexical, structural and spatial. | |||
| HTML Path structure | |||
| (Yang et al. 03, Shih and Karger 04) | |||
| Spatial random walk using browser-exposed DOM tree | |||
| Allows precise layout information (Xin and Lee 04) | |||
| A obvious approach is to build a supervised classifier | ||
| Train on labeled examples (f1,f2,…,fi,…,fn, C) | ||
| Test by distilling features (f1,f2,…,fi,…,fn) = ? | ||
| Training data costly, need to use unlabeled data | ||
| The feature sets are largely orthogonal | ||
| = Try co-training! | ||
Co-training (Blum and Mitchell)
| Two learners with separate views of the same problem | ||
| Characterize this as the example of classifying web pages | ||
| Link structure | ||
| Text on the page | ||
| Use one classifier to help the other | |||||
| e.g. pages that the link classifier is confident on should be passed as correct answers to the text-based classifier. | |||||
| Assumes that the individual classifiers are not bad to start with | |||||
| Otherwise noise level will escalate | |||||
| B&M co-training handles only binary classification | |
| Handles distribution skewing |
| PARser for Content Extraction & Layout Structure | ||
| Goals: | ||
| Coarse-grained classification | ||
| Fine-grained information extraction | ||
| Work on a variety of sources | ||
| Open-source, reference implementation | ||
| News stories | |||||
| Domain-specific fine grained classes (denoted by *) | |||||
| Needs XHTML / CSS support | |||||
| Blocks can have multiple classes | |||||
| Multi-class forced to single | |||||
| Assessor picks most prominent class | |||||
| Resulting corpus has skewed distribution | |||||
| 50 sites from Google News | |||||
| Not well-formed: Tidy first | |||||
| Main Content | |||||
| Site Navigation | |||||
| Search | |||||
| Supporting content | |||||
| Links supporting content | |||||
| Image supporting content | |||||
| Sub headers | |||||
| Site image | |||||
| Advertisements* | |||||
| Links to related articles* | |||||
| Newsletter / alert links* | |||||
| Date or Time of article* | |||||
| Source Station (country of report)* | |||||
| Reporter Name* | |||||
Lexical and Stylistic Co-training
| Split the document into blocks using DOM tree | |||||
| Nontrivial (overlapping blocks, visual segments differ) | |||||
| Co-train | |||||
| Learner 1 – Stylistic learner | |||||
| Spatial and structural relationship | |||||
| External relationship to other blocks | |||||
| Learner 2 – Lexical learner | |||||
| POS and link related features | |||||
| Internal classification irrespective of other blocks | |||||
| Layout: guess from first level DOM nodes | |||||
| Linear | |||||
| <Table>: Use reading order, cell type propagation | |||||
| XHTML / CSS (e.g., <DIV>): Translate relative to absolute positioning, model depth | |||||
| Font (CSS too): relative features | |||||
| Image size | |||||
| For each block: | |||||
| POS tag distribution in text | |||||
| Stemmed tokens weighted by TF×IDF | |||||
| IDF from Stanford’s web base | |||||
| Number of words | |||||
| Alt text of images | |||||
| Hyperlink type (e.g., embedded image, text, mailto) | |||||
| Adapted co-training: | |||||
| Sample balancing: preserve ratio of noisily labeled examples, poor performance without it | |||||
| Replace unlabeled data at each round | |||||
| Use BoosTexter: handles word features easily | |||||
| Five fold cross validation | |||||
| General performance? | |||||
| Specific performance on: | |||||
| Fine-grained classification? | |||||
| XHTML / DIV pages? | |||||
| Others’ tasks? | |||||
| Statistically significant improvement | ||
| Improvement on large classes at expense of minority | ||
| Despite sample balancing | ||
| No fine grained classes detected | ||
| Smaller dataset | ||
| 1/5 the size, limited sites for sample | ||
| Both annotated and unannotated data sets were smaller | ||
| As a result, fewer co-training iterations | ||
| Single view model still seems to do better | ||
| Slightly different model of splitting than earlier work | |
| Smaller amount of training examples | |
| No significant gain from co-training but comparable to other work (19.5% error vs. 14-18 error%) |
| Co-training model for web block classification | |||||
| Achieves 28.5% reduction in error in main task | |||||
| However, fails in | |||||
| Detecting fine grained classes | |||||
| → Exploit templates, IE methods, path similarity and context | |||||
| Likely needs enough unlabeled data | |||||
| → Re-run using more experimental data | |||||
| Dependent on learning model | |||||
| → Looking to change learning package | |||||
| Any questions? | |
| http://parcels.sourceforge.net/ | |
| Available in late November 2004 | |
| Annotator, evaluation tools provided | |
| Handles XHTML and DIV / CSS | |
| Open source, GPL’ed code |