Stylistic and lexical co-training for webpage block classification

Chee How Lee, Min-Yen Kan and Sandra Lai

National University of Singapore

kanmy@comp.nus.edu.sg

Web page blocks

What’s a web page block?

Parts of a web page with different functions

e.g., main content, site navigation, advertisements

Important for this distinction

Different names for the same thing

fragments, elements, blocks, tiles

Uses of block classification

Extract content to mobile device

“Just the facts, ma’am”

summarization = better (whole) page classification

Advertisement blocking

Fragment versioning

Distinguish navigation from content

better link-based ranking

Approaches to classification

Earliest systems use hard-coded wrappers

Content-focused (E.g., largest table cell)

Didn’t scale

Now, multi-class classification using mixed features: lexical, structural and spatial.

HTML Path structure

(Yang et al. 03, Shih and Karger 04)

Spatial random walk using browser-exposed DOM tree

Allows precise layout information (Xin and Lee 04)

Which approach to use

A obvious approach is to build a supervised classifier

Train on labeled examples (f₁,f₂,…,f_i,…,f_n, C)

Test by distilling features (f₁,f₂,…,f_i,…,f_n) = ?

Training data costly, need to use unlabeled data

The feature sets are largely orthogonal

= Try co-training!

Co-training (Blum and Mitchell)

Two learners with separate views of the same problem

Characterize this as the example of classifying web pages

Link structure

Text on the page

Co-training (cont’d)

Use one classifier to help the other

e.g. pages that the link classifier is confident on should be passed as correct answers to the text-based classifier.

Assumes that the individual classifiers are not bad to start with

Otherwise noise level will escalate

Architecture

B&M co-training handles only binary classification

Handles distribution skewing

PARCELS

PARser for Content Extraction & Layout Structure

Goals:

Coarse-grained classification

Fine-grained information extraction

Work on a variety of sources

Open-source, reference implementation

Target Classification

News stories

Domain-specific fine grained classes (denoted by *)

Needs XHTML / CSS support

Blocks can have multiple classes

Multi-class forced to single

Assessor picks most prominent class

Resulting corpus has skewed distribution

50 sites from Google News

Not well-formed: Tidy first

Main Content

Site Navigation

Search

Supporting content

Links supporting content

Image supporting content

Sub headers

Site image

Advertisements*

Links to related articles*

Newsletter / alert links*

Date or Time of article*

Source Station (country of report)*

Reporter Name*

Lexical and Stylistic Co-training

Split the document into blocks using DOM tree

Nontrivial (overlapping blocks, visual segments differ)

Co-train

Learner 1 – Stylistic learner

Spatial and structural relationship

External relationship to other blocks

Learner 2 – Lexical learner

POS and link related features

Internal classification irrespective of other blocks

Stylistic Features

Layout: guess from first level DOM nodes

Linear

<Table>: Use reading order, cell type propagation

XHTML / CSS (e.g., <DIV>): Translate relative to absolute positioning, model depth

Font (CSS too): relative features

Image size

Lexical Features

For each block:

POS tag distribution in text

Stemmed tokens weighted by TF×IDF

IDF from Stanford’s web base

Number of words

Alt text of images

Hyperlink type (e.g., embedded image, text, mailto)

Evaluations

Adapted co-training:

Sample balancing: preserve ratio of noisily labeled examples, poor performance without it

Replace unlabeled data at each round

Use BoosTexter: handles word features easily

Five fold cross validation

General performance?

Specific performance on:

Fine-grained classification?

XHTML / DIV pages?

Others’ tasks?

General performance

Statistically significant improvement

Improvement on large classes at expense of minority

Despite sample balancing

No fine grained classes detected

XHTML / DIV Evaluation

Smaller dataset

1/5 the size, limited sites for sample

Both annotated and unannotated data sets were smaller

As a result, fewer co-training iterations

Single view model still seems to do better

Rough grained model

Slightly different model of splitting than earlier work

Smaller amount of training examples

No significant gain from co-training but comparable to other work (19.5% error vs. 14-18 error%)

Conclusion

Co-training model for web block classification

Achieves 28.5% reduction in error in main task

However, fails in

Detecting fine grained classes

→ Exploit templates, IE methods, path similarity and context

Likely needs enough unlabeled data

→ Re-run using more experimental data

Dependent on learning model

→ Looking to change learning package

Question time!

Any questions?

http://parcels.sourceforge.net/

Available in late November 2004

Annotator, evaluation tools provided

Handles XHTML and DIV / CSS

Open source, GPL’ed code


	Chee How Lee, Min-Yen Kan and Sandra Lai

	National University of Singapore
	kanmy@comp.nus.edu.sg


	What’s a web page block?

	Parts of a web page with different functions
		e.g., main content, site navigation, advertisements
		Important for this distinction

	Different names for the same thing
		fragments, elements, blocks, tiles


	Extract content to mobile device
		“Just the facts, ma’am”
		summarization = better (whole) page classification

	Advertisement blocking

	Fragment versioning

	Distinguish navigation from content
		better link-based ranking


Earliest systems use hard-coded wrappers
	Content-focused (E.g., largest table cell)
	Didn’t scale

Now, multi-class classification using mixed features: lexical, structural and spatial.
	HTML Path structure
		(Yang et al. 03, Shih and Karger 04)
	Spatial random walk using browser-exposed DOM tree
		Allows precise layout information (Xin and Lee 04)


	A obvious approach is to build a supervised classifier
		Train on labeled examples (f₁,f₂,…,f_i,…,f_n, C)
		Test by distilling features (f₁,f₂,…,f_i,…,f_n) = ?

	Training data costly, need to use unlabeled data
	The feature sets are largely orthogonal
	= Try co-training!


	Two learners with separate views of the same problem
	Characterize this as the example of classifying web pages
		Link structure
		Text on the page


	Use one classifier to help the other
		e.g. pages that the link classifier is confident on should be passed as correct answers to the text-based classifier.

	Assumes that the individual classifiers are not bad to start with
		Otherwise noise level will escalate


	B&M co-training handles only binary classification
	Handles distribution skewing


	PARser for Content Extraction & Layout Structure

	Goals:
		Coarse-grained classification
		Fine-grained information extraction
		Work on a variety of sources
		Open-source, reference implementation


	News stories
		Domain-specific fine grained classes (denoted by *)
		Needs XHTML / CSS support

	Blocks can have multiple classes
		Multi-class forced to single
		Assessor picks most prominent class

	Resulting corpus has skewed distribution
		50 sites from Google News
		Not well-formed: Tidy first
	Main Content
	Site Navigation
	Search
	Supporting content
	Links supporting content
	Image supporting content
	Sub headers
	Site image
	Advertisements*
	Links to related articles*
	Newsletter / alert links*
	Date or Time of article*
	Source Station (country of report)*
	Reporter Name*


Split the document into blocks using DOM tree
	Nontrivial (overlapping blocks, visual segments differ)

Co-train
	Learner 1 – Stylistic learner
		Spatial and structural relationship
		External relationship to other blocks

	Learner 2 – Lexical learner
		POS and link related features
		Internal classification irrespective of other blocks


	Layout: guess from first level DOM nodes
		Linear
		<Table>: Use reading order, cell type propagation
		XHTML / CSS (e.g., <DIV>): Translate relative to absolute positioning, model depth

	Font (CSS too): relative features

	Image size


For each block:
	POS tag distribution in text

	Stemmed tokens weighted by TF×IDF
		IDF from Stanford’s web base

	Number of words

	Alt text of images

	Hyperlink type (e.g., embedded image, text, mailto)


	Adapted co-training:
		Sample balancing: preserve ratio of noisily labeled examples, poor performance without it
		Replace unlabeled data at each round
	Use BoosTexter: handles word features easily
	Five fold cross validation

	General performance?

	Specific performance on:
		Fine-grained classification?
		XHTML / DIV pages?
		Others’ tasks?


	Statistically significant improvement
	Improvement on large classes at expense of minority
		Despite sample balancing
	No fine grained classes detected


	Smaller dataset
		1/5 the size, limited sites for sample
		Both annotated and unannotated data sets were smaller
		As a result, fewer co-training iterations
	Single view model still seems to do better


	Slightly different model of splitting than earlier work
	Smaller amount of training examples
	No significant gain from co-training but comparable to other work (19.5% error vs. 14-18 error%)


Co-training model for web block classification
Achieves 28.5% reduction in error in main task
However, fails in
	Detecting fine grained classes
		→ Exploit templates, IE methods, path similarity and context

	Likely needs enough unlabeled data
		→ Re-run using more experimental data

	Dependent on learning model
		→ Looking to change learning package


	Any questions?



	http://parcels.sourceforge.net/

	Available in late November 2004
	Annotator, evaluation tools provided
	Handles XHTML and DIV / CSS
	Open source, GPL’ed code