1
|
- Chee How Lee, Min-Yen Kan and Sandra Lai
- National University of Singapore
- kanmy@comp.nus.edu.sg
|
2
|
- What’s a web page block?
- Parts of a web page with different functions
- e.g., main content, site navigation, advertisements
- Important for this distinction
- Different names for the same thing
- fragments, elements, blocks, tiles
|
3
|
- Extract content to mobile device
- “Just the facts, ma’am”
- summarization = better (whole) page classification
- Advertisement blocking
- Fragment versioning
- Distinguish navigation from content
- better link-based ranking
|
4
|
- Earliest systems use hard-coded wrappers
- Content-focused (E.g., largest table cell)
- Didn’t scale
- Now, multi-class classification using mixed features: lexical, structural
and spatial.
- HTML Path structure
- (Yang et al. 03, Shih and Karger 04)
- Spatial random walk using browser-exposed DOM tree
- Allows precise layout information (Xin and Lee 04)
|
5
|
- A obvious approach is to build a supervised classifier
- Train on labeled examples (f1,f2,…,fi,…,fn,
C)
- Test by distilling features (f1,f2,…,fi,…,fn)
= ?
- Training data costly, need to use unlabeled data
- The feature sets are largely orthogonal
- = Try co-training!
|
6
|
- Two learners with separate views of the same problem
- Characterize this as the example of classifying web pages
- Link structure
- Text on the page
|
7
|
- Use one classifier to help the other
- e.g. pages that the link classifier is confident on should be passed as
correct answers to the text-based classifier.
- Assumes that the individual classifiers are not bad to start with
- Otherwise noise level will escalate
|
8
|
- B&M co-training handles only binary classification
- Handles distribution skewing
|
9
|
- PARser for Content Extraction & Layout Structure
- Goals:
- Coarse-grained classification
- Fine-grained information extraction
- Work on a variety of sources
- Open-source, reference implementation
|
10
|
- News stories
- Domain-specific fine grained classes (denoted by *)
- Needs XHTML / CSS support
- Blocks can have multiple classes
- Multi-class forced to single
- Assessor picks most prominent class
- Resulting corpus has skewed distribution
- 50 sites from Google News
- Not well-formed: Tidy first
- Main Content
- Site Navigation
- Search
- Supporting content
- Links supporting content
- Image supporting content
- Sub headers
- Site image
- Advertisements*
- Links to related articles*
- Newsletter / alert links*
- Date or Time of article*
- Source Station (country of report)*
- Reporter Name*
|
11
|
- Split the document into blocks using DOM tree
- Nontrivial (overlapping blocks, visual segments differ)
- Co-train
- Learner 1 – Stylistic learner
- Spatial and structural relationship
- External relationship to other blocks
- Learner 2 – Lexical learner
- POS and link related features
- Internal classification irrespective of other blocks
|
12
|
- Layout: guess from first level DOM nodes
- Linear
- <Table>: Use reading order, cell type propagation
- XHTML / CSS (e.g., <DIV>): Translate relative to absolute
positioning, model depth
- Font (CSS too): relative features
- Image size
|
13
|
- For each block:
- POS tag distribution in text
- Stemmed tokens weighted by TF×IDF
- IDF from Stanford’s web base
- Number of words
- Alt text of images
- Hyperlink type (e.g., embedded image, text, mailto)
|
14
|
- Adapted co-training:
- Sample balancing: preserve ratio of noisily labeled examples, poor
performance without it
- Replace unlabeled data at each round
- Use BoosTexter: handles word features easily
- Five fold cross validation
- General performance?
- Specific performance on:
- Fine-grained classification?
- XHTML / DIV pages?
- Others’ tasks?
|
15
|
- Statistically significant improvement
- Improvement on large classes at expense of minority
- No fine grained classes detected
|
16
|
- Smaller dataset
- 1/5 the size, limited sites for sample
- Both annotated and unannotated data sets were smaller
- As a result, fewer co-training iterations
- Single view model still seems to do better
|
17
|
- Slightly different model of splitting than earlier work
- Smaller amount of training examples
- No significant gain from co-training but comparable to other work (19.5%
error vs. 14-18 error%)
|
18
|
- Co-training model for web block classification
- Achieves 28.5% reduction in error in main task
- However, fails in
- Detecting fine grained classes
- → Exploit templates, IE methods, path similarity and context
- Likely needs enough unlabeled data
- → Re-run using more experimental data
- Dependent on learning model
- → Looking to change learning package
|
19
|
- Any questions?
- http://parcels.sourceforge.net/
- Available in late November 2004
- Annotator, evaluation tools provided
- Handles XHTML and DIV / CSS
- Open source, GPL’ed code
|