Slide 1

Two Methods of Pattern Matching

Hard Matching

Rule induction

Generalizing training instances into regular expression represented rules

Performing slot by slot matching

Soft Matching

Hidden Markov Models (HMM)

Soft pattern matching for definitional QA (Cui et al., 2004)

Hard Matching

Lack of flexibility in matching

Can’t deal with gaps between rules and test instances

Soft Matching (Cui et al., 2004)

…… The channel Iqra is owned by the …
…… severance packages, known as golden parachutes, included ……
A battery is a cell which can provide electricity.

Weakness of Current Soft Matching Models

Ad-hoc in model parameter estimation

Cui et al., 2004: Lack of formalization

Not generic

Task specific topology for HMM

Difficult to port to other applications

In This Paper …

Propose two generic soft pattern models

Bigram model

Formalization of our previous model

Profile Hidden Markov Model (PHMM)

More complex model that handles gaps better

Parameter estimation by EM algorithm

Evaluations on definitional question answering

Can be applied to other pattern matching applications

Outline

Overview of Definitional QA

Bigram Soft Pattern Model

PHMM Soft Pattern Model

Evaluations

Conclusions

Definitional QA

To answer questions like “Who is Gunter Blobel” or “What is Wicca”.

Why evaluating on definition sentence retrieval?

Diverse patterns

Definitional QA is one of the least explored areas in QA

Pattern Matching for Definitional QA

Outline

Overview of Definitional QA

Bigram Soft Pattern Model

PHMM Soft Pattern Model

Evaluations

Conclusions

Bigram Soft Pattern Model

To estimate the interpolation mixture weight λ

Expectation Maximization (EM) algorithm

Count words and general tags separately

Avoid overwhelming frequency count of general tags

Bigram Model in Dealing with Gaps

Bigram model can deal with gaps

Unseen tokens have small smoothing probabilities in specific positions

Outline

Overview of Definitional QA

Bigram Soft Pattern Model

PHMM Soft Pattern Model

Evaluations

Conclusions

PHMM Soft Pattern Model

Better solution for dealing with gaps

Left to right Hidden Markov Model with insertion and deletion states

How PHMM Deals with Gaps

Calculating generative probability given a test instance

Find the most probable path by Viterbi algorithm

Efficient calculation by forward-backward algorithm

Estimation of PHMM

Estimated by Baum-Welch algorithm

Using the most probable path during training

Random or uniform initialization may lead to unsatisfactory model

Extreme diversity of definition patterns and not sufficient training data

Assume path should favor match states over others

P( token | Match ) > P ( token | Insertion )

Using smoothed ML estimates

Outline

Overview of Definitional QA

Bigram Soft Pattern Model

PHMM Soft Pattern Model

Evaluations

Overall performance evaluation

Sensitivity to model length

Sensitivity to size of training data

Conclusions

Evaluation Setup

Data set

Test data: TREC-13 question answering task data

AQUAINT corpus and 64 definition questions with answers

Training data

761 manually labeled definition sentences from TREC-12 question answering task data

Comparison systems

State-of-the-art manually constructed patterns

Most comprehensive manually constructed patterns to our knowledge

Previously proposed soft pattern in Cui et al., 2004

Evaluation Metrics

Manually checked F₃measure

Based on essential/acceptable answer nuggets

NR – proportion of returned essential answer nuggets

NP – penalty to longer answers

Weighting NR 3 times as NP

Subject to inconsistent scoring among assessors

Automatic ROUGE score

Gold standard: sentences containing answer nuggets

Counting the trigrams shared in the gold standard and system answers

ROUGE-3-ALL (R3A) and ROUGE-3-ESSENTIAL (R3E)

Performance Evaluation

Soft pattern matching outperforms hard matching

Bigram and PHMM models perform better than the previously proposed soft pattern method

Previous soft pattern method is not optimized

Manual F3 scores correlate well with automatic R3 scores

Sensitivity to Model Length

PHMM is less sensitive to model length

PHMM may handle longer sequences

Sensitivity to the Amount of Training Data

PHMM requires more training data to improve

Discussions on Both Models

Capture the same information

The importance of a token’s position in the context of the search term

The sequential order of tokens

Different in complexity

Bigram model

Simplified Markov model with each token as a state

Captures token sequential information by bigram probabilities

PHMM model

More complex – aggregated token sequential information by hidden state transition probabilities

Experimental results show

PHMM is less sensitive to model length

PHMM may benefit more by using more training data

Outline

Overview of Definitional QA

Bigram Soft Pattern Model

PHMM Soft Pattern Model

Evaluations

Conclusions

Conclusions

Proposed Bigram model and PHMM model

Generic in the forms

Systematic parameter estimation by EM algorithm

These two models can be applied to other applications using surface text patterns

Soft patterns have been applied to information extraction (Xiao et al., 2004)

PHMM is more flexible in dealing with gaps, but requires more training data to converge

Q & A


	Hard Matching
		Rule induction
		Generalizing training instances into regular expression represented rules
		Performing slot by slot matching
	Soft Matching
		Hidden Markov Models (HMM)
		Soft pattern matching for definitional QA (Cui et al., 2004)


	Lack of flexibility in matching
		Can’t deal with gaps between rules and test instances


	…… The channel Iqra is owned by the … …… severance packages, known as golden parachutes, included …… A battery is a cell which can provide electricity.


Ad-hoc in model parameter estimation
	Cui et al., 2004: Lack of formalization
Not generic
	Task specific topology for HMM
		Difficult to port to other applications


Propose two generic soft pattern models
	Bigram model
		Formalization of our previous model
	Profile Hidden Markov Model (PHMM)
		More complex model that handles gaps better
Parameter estimation by EM algorithm
Evaluations on definitional question answering
	Can be applied to other pattern matching applications


	Overview of Definitional QA
	Bigram Soft Pattern Model
	PHMM Soft Pattern Model
	Evaluations
	Conclusions


	To answer questions like “Who is Gunter Blobel” or “What is Wicca”.
	Why evaluating on definition sentence retrieval?
		Diverse patterns
		Definitional QA is one of the least explored areas in QA


	To estimate the interpolation mixture weight λ
		Expectation Maximization (EM) algorithm
	Count words and general tags separately
		Avoid overwhelming frequency count of general tags


	Bigram model can deal with gaps
		Unseen tokens have small smoothing probabilities in specific positions


	Better solution for dealing with gaps
	Left to right Hidden Markov Model with insertion and deletion states


	Calculating generative probability given a test instance
		Find the most probable path by Viterbi algorithm
		Efficient calculation by forward-backward algorithm


Estimated by Baum-Welch algorithm
	Using the most probable path during training
Random or uniform initialization may lead to unsatisfactory model
	Extreme diversity of definition patterns and not sufficient training data
	Assume path should favor match states over others
		P( token \| Match ) > P ( token \| Insertion )
	Using smoothed ML estimates


Data set
	Test data: TREC-13 question answering task data
		AQUAINT corpus and 64 definition questions with answers
	Training data
		761 manually labeled definition sentences from TREC-12 question answering task data
Comparison systems
	State-of-the-art manually constructed patterns
		Most comprehensive manually constructed patterns to our knowledge
	Previously proposed soft pattern in Cui et al., 2004


Manually checked F₃measure
	Based on essential/acceptable answer nuggets
		NR – proportion of returned essential answer nuggets
		NP – penalty to longer answers
		Weighting NR 3 times as NP
	Subject to inconsistent scoring among assessors

Automatic ROUGE score
	Gold standard: sentences containing answer nuggets
	Counting the trigrams shared in the gold standard and system answers
	ROUGE-3-ALL (R3A) and ROUGE-3-ESSENTIAL (R3E)


	Soft pattern matching outperforms hard matching
	Bigram and PHMM models perform better than the previously proposed soft pattern method
		Previous soft pattern method is not optimized
	Manual F3 scores correlate well with automatic R3 scores


	PHMM is less sensitive to model length
	PHMM may handle longer sequences


Capture the same information
	The importance of a token’s position in the context of the search term
	The sequential order of tokens
Different in complexity
	Bigram model
		Simplified Markov model with each token as a state
		Captures token sequential information by bigram probabilities
	PHMM model
		More complex – aggregated token sequential information by hidden state transition probabilities
Experimental results show
	PHMM is less sensitive to model length
	PHMM may benefit more by using more training data


	Proposed Bigram model and PHMM model
		Generic in the forms
		Systematic parameter estimation by EM algorithm
	These two models can be applied to other applications using surface text patterns
		Soft patterns have been applied to information extraction (Xiao et al., 2004)
		PHMM is more flexible in dealing with gaps, but requires more training data to converge