Welcome to TiddlyWiki created by Jeremy Ruston, Copyright © 2007 UnaMesa Association
Kui-Lam Kwok
Web IR
* response time
* adversarial, commercially tied
* aol debacle 2006. ids removed, but easy to trace back user info.
* deep web queries as TREC style research
** brightplanet.com DQM/P and Deepwebtech.com
* Garfield's Algorithm Deviation Indexing for Emma
* Kwok 95 PIRCS Network
3243 midterm again
ZhaoJinEmail (5 Oct)
LRECemailReminder - done
citeseer copy - done
HYP/UROP
* parallel corpus collection / MorphoMT (done)
* lightweight nlp - (done)
* citation typing (urop) - done
* FirefoxExtension (impl) (done)
* scientific ir ** (postponed)
* keyphrase and tagging (hyp) - done ** (removed)
TCS
* visit / slides
* gautam/anantaram follow up
Bought new server
*cte down again!!?!
* ilo slidealign - done, yay
* tomm editing?
* more a0001 stuff
* Finished google queries of all of citeseer. yay!
* fixing mailman. seems to be a host (wing vs aye) config problem.
* done copying out of db1, now on papers.
* kenny lew meeting, ask to contact yves dassas. do ip reports.
* finished shiren editing. whew!
* basic version of cache crypt function
zz grp
hanoi pics
son rec
melvin verif x2
caslyn2: icadl hotel, ijcnlp request
sophia recs to ie
icadl day email followups
icadl presentation
send ed parcels toolkit
email kristine cfp for idjl,
send pictures to group
ask yf to read kazunari sugiyama's work, email yf reply to jcdl lf-sf
fix emma webpage
phys bill catchup
msra cfp
emailed ijdl reminders
crazy limsoon request
email kristine about paula proctor
* Sent Springer email.
* Renewed domain names.
* IM meeting for journal paper.
* Dongwon's visit application packet prep.
* lyrics text file results uploaded.
Back from Seoul, Korea
* Chimetext announcements
* battery exchange
* long arm stapler borrow
* seoul expenses
* print cover of program
* dAnth email
* edited program stuff
clearing emails
fax starhub giro
gm's presentation stuff
do header parsing - partial need to install more libs - > push to admins
mail sembcorp parking giro
lrec paper edits from bonnie
sfr re-report x3
cm stuff
1101
* student cm problem - done
* post summ - done
* fraction representation - done.
* testing and debugging lecture notes
premia board meeting
3243 mt - prelim - finalized > print
ntu lectures > followup
resched dental
do yeefan prop
* key for EC for group meeting
* reboot cte (sigh)
* volunteer for AIRS booklet
* reply to drago
''todo''
* get speakers back
* write chee how
* moderation
* editing
** shiren's paper
** bang's paper
* reading
** yee fan's query document
* xs
* meetings
** franz och
** dongwon lee
* fo's invited talk at iscslp
* thang urop meeting
* huangzy lunch
* citeseer cleanup
** back online again
** all logs moved to rp and cte
* tyl ntu meeting with jin
* nyberg meeting and lunch
* hari letter
* acl reg figure out
* mail letter / do date change for oap
* practice sessions with hendra (weps & acl)
* submit oap simone
* slides from jesse for jcdl
* mileage claims
* slideseer conversion
* income tax interest payment / nol 2004/2005
* do slides
* meetings
* 5246 initial website created
* more pdfbox
* airs program finalization and prep
* dell monitor
* netflixprize download
* lib meeting
* clearlyunderstood stuff
* tomm meeting
* tomm editing
no coverage but optional to student
* assertions ch 8
* file i/o ch 12
* no maps in array chapter 10
* recursion but no data recursion, just procedural recursion. (for recursion ch15)
* testing and debugging
aaron's slides are in CS1101X. ~cs1101x
1st 4 wks, must cover up to iteration
* first lab wk 3.
* will set PE with leeml
have 18 UDLs + 3 TAs
6 UDL + 1 TA per group
Grading
15% {PE,TT1,TT2}, 15% Labs, 40% Exam
* 5246 tut 2, lect 6, tab mt feedback, web interface revamp, update gcal
* pdfbox conversion restarted
** mapReduce.rb updated with new machines
** problem converting with new machines
* got 2xside tape
* chimetext graeme
* wangshuo rec
* editing
** bang's paper
** hendra cv
* lin rec
* reviews
** jcdl printouts
* submit www07textgraphs
* submit hiring forms
* got the disk sent down to helpdesk
* 5246
** tutorial
** lecture
** hw 1 grading, demo
* renew domain aam
* ask for fare quotes
* pt work claims for staff
* duc fax to nist; speaker proposal
* editing ecdl: yeefan and jin
* hiring pt forms
* buy light for bathroom
* ergin recs
file blog06 reimbursements, file mda grant hiring - jesse
dac2008 x1, figures
3243: nb and knn posting
1101: grade TT2, dl meeting, signing,
ijdl special issue tracking down.
malindo workshop
mit vidconf: position paper, vc, followup 1 para
su nam interview
dac discussion
3243: grade 2T
* 5244 lecture
* tomm editing
* email replies
* dAnth work for drago?
* indian buffet reading
* airs scheduling
first classes: late to 3243 first lect :-P
completed icadl reviews -yay
218 to cao
review yf's proposal
1101 site updates
ivle updates for 3243
3243 poll
3243 ta tanyeefa
request robots 3243
did tech report
zaw lin meeting
ask for switch
udl meeting sched
minh thang seats
scholarship for qiul and hendrase
489.44 ziheng
vldl reviewer assignments
updated sps system
police report to clementi branch
meeting with ramesh
request rack shelf / aye / ecp,pie movement
npic - connecting to boostexter.
creating boostexter class
* read wang dong's grp / wang dong's grp
* noi reading
* reminded alicia about bookings
* catering by gourmet (barry 6275-4058), emailed menu, pr blitz.
* xs
* noi meeting / problem prep / programming soln 1 / programming soln 2
* bbq 1,2 (thanks hugh) pit and func room booking
* googled linkage paper editing
* dad's gallery install / firstcycle update
* hw 2 grading
* intrusion problem fixed for now
* bug ppl about gcpps
* get cardboard from storage closet
* fix wing webserver/mailinglist
** survey ides for workshop
* graphreading
* special group meeting
* more airs stuff
* dAnth sample conversion
* 5244 lecture discussion
* dAnth mailing list set up, invitations sent
* IJDL proposal editing
* cacm editing done
* fix up joomla pages
* prager, hang li, emma slides on chimetext, meetings
* check liangzhu's june pay
* ivle 5244 site fixes
* ask unixsp to remount disks, done
* Kenny Lew IDF for npic
* initial try to integrate db to citeseer
* DavidChiangSynchronousGrammars seminar talk
must finish:
* alexia's article -- finished, yeah!
* npic cgi - rats, the multipart handling is not standard as per normal cgi.rb. -- ''done'' ok a really crappy version is up now. Got lots of bugs to go fix later.
* birthday migration to calendar -- ''done''
* start npic poster and handout
* eTochi stuff
* acl practice session 1 / snack prep / practice session 2 announcements / premia / qe marks meeting
* simone email / cv / citeseer sending / hr about sabbatical
* ask cuntai about hari
* omnipage scp hookup / try acl anth conversion
* book lab for RoR / RoR meeeting / RoR mailing list set-up
* vldl follow ups / rebroadcasts / wyma vldl
* mohanan meeting / follow ups
* kokkoon email
* slideseer prelim slide view
* group meeting
* retrained xuan's model
* sent out wing news
* revise slides for plag / moss
* fix meetings page
* tomm edits
* prep for grading 5244 - imms2
* hyp for wangye
* MS apps finished.
* update WING
* cs fixed from isaac
* cs copying to lacie started
* reading thesis proposal
* cs 5244 class #1
* stevenha's tp
* student eval
Zhao Jin: 4:07 - 4:17
Qiu Long: 4:25 - 4:35
Yee Fan: 4:39 - 4:49
Jesse: 4:51 - 5:01
Ziheng: 5:10 - 5:20
Bang: 5:29 - 5:39
Emma: 5:43 - 5:53
Hendra: 6:00 - 6:10
* 5246
** wk2 lect notes
** wk2 lect
** figure out tutorial rooms
* car
** transfer monies
* editing
** linziheng
** yeefan
** ss
** jesse
** syan
* hari oap app
* fax duc thing under "Timestamped Graph: a Graph Model for Text Summarization"
* ss dev
** cvs'ed the whole darn thing
** lucene primitive support / highlighting mess
** fix metadata import
** fix screenshots
* paper review meeting
got lyrics done for ieee tmm - finally!
connecting text extractor and slide gif extractor -- still working , humph!
also printed out the alignment papers - finally!
qiul ijnlp 2008 x2, done
vldl x2
blog corpus: faxed
visa reimbursement: submitted
cs1101:peq x2
ijit
widm07 pdf link
cs3243: loa links, grades, scoreboard
* CS 5244 grading
* airs brochure stuff
* write load balancer. fix MR bugs
* send preview of group meeting
* start us tax
* bang airtix
* finish final exam duties
* qe questions
* chimetext sem
* 5246 setup hw2 grading
* read hyp theses
* grade hyp theses 2/6
* op15 install on kpe
* op15 figure out
* book car
* hyp evaluations
* still copying to mnt/usb
* lots of 5244 posts
* ILO stuff
* updated tw with old blog entries. see tag oldBlog
* special prog stuff
* decompress acl anth mirror to sf3
* son's rec letter
* yee fan's cover letter
* partially complete - scanned picture edits
* sent isaac copy of bin and lib of cs
* sent danth url for good/bad conversions
Jing Jiang's talk
v2 of annot guide
packing for as6
forecite refactor
citeseer check
yeefan's / ergin's widm
isaac parsCit send
hari prep / dinner
jin's survey
danth hw email
galv4 restore
grad apps
velardi email
Editing IPM, Shiren's DUC notebook and learning more about CRFs from Sutton and Mc Callum's book chapter. After talking with Yee Whye, we said:
* read Semi Markov CRF by Sarawagi and Cohen
* read Maximizing Log Probability by Wainwright
* deconstruct CRF packages
* done with 3rd round on cacm.
* hsbc fax
* phd app review
* chuats meeteing
* cs metadata checking
* answered sherman.
* send out papers for group review
* more old picture edits
6789 8188 11-13 Jan (MI 368 18:35-20:00 - MI 367 20:40-22:15) Langkawi - 15K miles, 135 taxes KDNNID
ticket bangalore flight
az pass x 3
atap eval
simone talk
simone cheque
ergin recs + email recs
son recs
siva interview
group meeting
book raffles buffet
mcomp reviews - done
3243 midterm administrator
vet jesse's ityouth
chimetext scheduling
ror xtremeapps team
tcs trip
* flight / accom done by tcs
* title and abstract - written
* biodata fetched
cm coord
* mapping
icadl
* flight - done
* copyright release - done
* sher - confirm 524008204
* registration - bank transfer - fax - done
partial premia
* newsletter - out part
* forum - installed, forget bridge
initial acl08 reviewer list
* converting acl anth via pdftohtml
** acl2004, coling2004, hlt-naacl2004, muc7
** X,T,A,I,M,N
** in progress: J, C, W, E, eacl2003
** need to do: P, H
* hang's thesis?
* read over patent stuff in acl anth
* more phd apps - now all done
* dongwon meeting
* sv, dv debugging
* xs
* serc
* wing
** portal fixing
** administrator's update
* 5246 prep
** updated syllabus page - links to chapters and slides from 2 textbooks
** added s52 forms to course pack
* chimetext errands
* editing
** bang's paper
** yeefan's joint paper
** jin's paper
* victor's recommendation done
* Working on getting the appropriate pdf files, connecting them into the pdf995edit utility via RDC.
* the alignment baseline
* replaced p2da-1.0 tgz with 1.1 tgz
* 5246
** homework grading, enter grades to xls
** lecture notes, disc q
** book tutorial room
* hlt trip prep, hk trip prep
* hyp prep for interviews, interviews
* wangye acmmm
* edit neoshiyo, zhaojin, nghongi
* citi payment
* ask off quote banff
3243: sent out grades, reported means, emails, immsnet
1101: immsnet
DAC 2008: done
do hyp grading - started
1101 queries and collections.
qiul/hendrase rsearch fee waiver
seating for hanoi
do ijcnlp hotel booking - started
grader.pl fix by tyf
Reading/Editing hendrase: sect 2-4
"A component model for internet-scale applications"
* do start mailing to airs authors
* imms testing acceptance
* uist editing
* 5246 lecture notes wk 12
* file sg tax
* cjx defense
* jcdl final revision round 2
* acl flight book, hotel inq
* check opac is up
* print hang's papers
* emnlp printouts
* basic qe set
* basic exam set
* hyp disc and decs
** kalpana ch 1 2 and 6
** bang ch 1 and 6 - done
** emma all - done
** yue ch 1 2 3 4
** ziheng all - done
** jesse chap 6
*omnipage 15 auct
*added self to all wing sudo files
* bring spinelli cards in
* RoR
** rar 1.2.3 install on aye
** installed fcgi-2.4.0.tar.gz on aye
** installed httpd-devel via yum on aye
** installed mod_fcgid2.0
** updated forum with sysadm posts
** got cookbook example to work on /cookbook, with local path /var/www/cookbook
* Hang's thesis editing
* Dongwon IJDL first draft done
* HKUST emails done
* Hang's SMS collection set up
* editing cacm article with Yee Fan
* incampus
* mmies review prep
* tw canonicalization
* taslp prep
* opac / danth networking
* textbook comp
* zawlin meeting / annotators hi / kp email
* cec accredit work
* late late icadl sub
* hang visit
* law visit
* cancel newspaper
* chase airs student reg - done
* ask about grant
* grading done
* leave app
report for feedback - redo
payment zaw lin
linked anthology email draft
submit qiul paperwork
rpnlpir zone account appn
arrange kathy time
give / get monitor back
did 1101 lecture notes
resub reviewers ijdl
get breakthrough driver working on sunfire again.
redo rdale email
refactor gcal events to public private
dive options
sigir emails and sched
* Coded jing's basic HMM algorithm
* Coded bigram jaccard
* tomm to taslp conversion
* xuan practice
* grp report sub
* hs' scholarship renewal
* got check back from gh office
* ijdl invite to google
* cert prep and distribution
* hep shot
* xs
* franz och's invite
* dell replacement
* ts proposal for cu
* eepeng/dongwon special issue
* serc init prep
* airs
** post springer receipt to av consultants
** airs 2006 makes it into dblp
* 5244
** archive 5244 project presentation files
** prep 5244 project grading
** revision lecture
* poster session
** archive poster session work
* hlt-naacl 07 reviews
* AIRS booklet
* denny practice session
* student meetings
center title
2:53 start - 5 min
good examle of diff btw voice and singing
histogram notes not clear b.2
too much nav
need very clear explanation of exper syl vs word - perhaps disclaim, good comeback on single syl
Don't need A and B subscripts (ok) prop to experimental results table.
* finished cs copying for hidetsugu
* emails out
* patent stuff for chuats
* ijcai reviews half done
* short week (CNY holidays)
* chimetext scheduling
* dAnth conversion (0-200, finished 700s, 600-650, posted on danth Wiki)
* dAnth garbled conversion investigation
* HLT-NAACL review
* 5246 tuts 3 and 4 released, emails
* ganglia installed by admins
* did cny mailout
* JCDL review
* updated my schedule. Finally!
* airs proceedings paper collection
* acl practice session 2 and reimbs
* RoR connect to ParsCit
* mit csail stuff / ppt prep
* jair review yay
* mohanan project comment cat, job desc
* annual interview kalpana
* skype ijdl meeting
* prep 5244 survey grading / printouts
* fixed merlion image
* check dblp+ spidering
* finish materials for prelim cd
* 5244 discussion
* cd design
* embassy stuff and personal record keeping done
* check on acl anth conversion:
** done: A, C, E, H, I, J, M, N, P, T, X, acl2004, coling2004, hlt-naacl2004, eacl2003,
** in progress: W
* 5244 wk2 stuff
* 5244 class
* reading hang's tois edits r1-3 ok.
* hang's thesis re-edit. not yet there...
* 5246
** lecture 3 and 4
** prep hw1 - corpus, web interface, indexing, qrel judging, qrel assignment, qrel sample files
** tut 0 and 1
** room switch to SR 2
* car
** pick up
** stickers
* editing
** qiul (x3)
** hendra (x3)
** yeefan - and fax cprght form to acm
** ziheng (x2)
* syan paper
* bang's survey
* meetings
** cu
** donny
* hlt short review printout
* hsbc cc waive and redeem
* rg 7 for 245
* did flights booking
* aligner inspection done
* answer yves and begona's email
* answered yee fan's email
need to do today
* prep or do lyrics
* prep or do RG 5
* npic poster
setting exams: 1101 3243
3243: demo scheduling, ivle and mails
wingnews out
CS 1101: exam question setting, make up lecture, poll offline, plag
derry's grp: done comments
yeefan's prop
pick up pottery
* AIRS
* meetings
* grade remaining hw1
todo
* 5244 hw2 source finding
* ijdl form
* kp mohanan meeting
* ror install on macos
* 5246 grading
* svm training over crf++ for citation parsing
* ''hlt naacl / text graphs 2 / duc 2007''
* email sendouts / wingnews invites
* update my pub list
* ask hendra update wing pub list
* rec letters for ziheng, vu
* crp work
* trip expenses
* done with ACL anthology conversion by pdftohtml
* SERC HFE workshop
* hang's phd thesis edits (again)
* citeseer/singapore_copy/papers/
* emic versus etic (think phonetic vs phonemic): what are the differences in units for semantic
tuition waiver
pic continue backup
move to as6
read shanhengs prop
mmies
hari's 1st talk
grad apps
* dAnth stuff / links / new slice prep for umich
* setup new computer at home
* launchy install for office pc
* noi 2007 initial problem setting
* short paper IUI
* http://www.nus.edu.sg/comcen/acctman/
* chimetext chia tee kiah
* wk 8 discussion questions
* list for ijdl
* tomm revision / discussion to retarget to TASLP
* started new grant proposal
* dongwon visit stuff
* do ppt/pdf upload to AIRS 2006
* do metadata for AIRS 2006 for DBLP
* borrow drill
* TASLP email retarget
* fixup geoip usage data on citeseer
* w10 lecture revision and distribution
* final exam writing and origin report
* course pack artwork for 5246
* w11 lecture reading
* ijdl cover and cfp
* start printing cds
* stuck with dvd reading problems / udf problem with citeseer
* serc proposal time
do ijdl assignments
acl logins assigned
qiul thesis chapters: applications, sim step 1, sim step 2
az wrap up: context tags
install printer for psn518
print and read [PDF] Citation Analysis and Discourse Analysis Revisited -HD WHITE
Review/Non-review classification web snippet classification
ijclclp review
Returned from COLING/ACL/EMNLP. Whew.
* Hang's defense
* Claim forms
* Dongwon's visit application
* iras stuff
feedback x 2
phd review app
ieee taslp
premia newsletter
header reparse x2
bring fork
do jesse it to naomi to sign
yf proposal scheduling
qiul ijcnlp paper
acl08 reviewer list x2
final reports to students.
final data for archiving
3243 - midterm grading - tutorial updates - hw2 poll - qa - tut sol 8
for hw1 - midterm answers
sigir email warn
cl lab hyp/urop/admin students
* get back dvds and drive
* bang to copy sigmod anthology to cte,citeseer
* usage and crontab job on citeseer, http://citeseer.comp.nus.edu.sg/usage/ (don't forget trailing slash)
* a001 / cv
https://aces01.nus.edu.sg/sop/WebPageHandler
* graphreading
* airs
** tiddlywiki - done
** invited talks
* cs5244
** hw2 prep done
** makeup lecture logistics
** project page updating
* xs
* fix up ssSpider.rb class instances
* at jobs for ss
* cd printing
* altw reviews
* short week
* course pack submission
* dongwon
** final talks with Dongwon on jcdl sub
** atung discussion
** final report
* scan, pgped review
* special programme
** review wiki thang's email.
** thang's access
* wing
** email group meeting sched
** update website
* serc hfe
** set up phone meeting with tyl
** download template
* ijdl
** ia re-invites
* ss
** did print and full slide view
** fix css for hrefs
** fix alignment bugs
** redid url munging for data sources
* editing
** jin's short jcdl
** bit of shiren's paper
** jinxiu's thesis reporting
* got reimb from GH for bbq
* edit sigir poster
* resurrected (partially) parsCit
* 5246
** slides: qa, intro sum
** qrels for hw1
* jcdl reviews
* faxes
** acm tois page charges
** steves ip to cu
* chimetext
** ad
** updated schedule / room booking
* wing
** sending out reminder
** updated project descripts, finally
** work for google page
* editing
** ieee mm
** textgraphs2 final
* AIRS stuff
* added marshaling to aligner.rb
* coded mmStats.rb
Not yet done with align jump gotta fix that tomorrow.
* 5246
** tut, lect notes
* jcdl cam ready
* hyp interviews
* editing
** emma hyp, ziheng hyp, qiul emnlp, jesse uist, bang qlw final, ziheng duc
* simone email
* 4/12 to amex plat us fidel
* reserve with airserve
* chimetext
* cv editing for rita
* sf3/rpnlpir mig details
* opac virthost
* omnipage disc, ebay bid
* 4247 mod
* emnlp review prefs
* get shot
* bring back hair spray
* send out new land's end to us
computer fixed
1101: final grading, immsnet
3243: final, grading, immsnet, make up mt and final grading
2305: prop grading
Anthology: bug fixes started, NLG waiting on busemann
car: paint bought
slides for chuats
taslp fix again
cancel rg1 extension
vldl review for hvds
do icadl slides
cm bachelors coord: questions and clarifications
http://www.textfrompdf.com/tfpspeed.htm
http://jabref.sourceforge.net/
Citeseer: think about parsehed alignment
do tenure list
sum blog sum to kathy
Shroff: SusanFeldman / Autonomy
* altw reviews
* sigmod anth reimbursements
* ask for support from sanjay - got it
* reading survey papers for 5244 / about half done.
do and driver to webpage
muime's card
do xuan's exercise
sigir edit
icadl jin problem resolved
did nutch exercises - 1/2
did 3243 tut #1
mohan ramesh mtg
annual review docs x2
paula procter mtg
batts
cliqa sub review
call raffles do 2x confirm
sub icadl final
do nutch 2/2
* 5244
** particip grades
** project pres grades
** grade projects
** grade final
* marks moderation
*noi 2007 trace debugging
* bbq
** CS: david & diane, huangzy (4), ooiwt, ben & waiping, abhik & tulika, haifeng, kok lim, mun choon, samarjit, chee yong,
** IS: calvin
** WING: bang, ziheng, yue & shuo, emma and hoang oanh, jin, long, hendra, jesse and ailin, yee fan, lianngzhu and friend
** guests: dongwon
* xs
* check funds for hendra to go to iscslp
* avik sarkar training data parcels
* editing
** denny x3, now done
** taslp text results compilation
* meetings
* mcomp meeting
* mapreduce.rb debug(ging)
* graphreading
* airs
** booklet update
** schedule re-export
** cd printing finished
** final number pushed to springer
* did student claims
* a0001 forms turned in
* sigmod anthology copying done? got to ask acm for permission to host
* pics done and uploaded
* tois fourth round edits finished
* cache code
* got pdf995edit pipeline working
* airs 2006 publication stuff
* fixed schedule with google calendar embed
* reimbursements for software
* csail-mit workshop / demo prep
* fd ocbc
* annual reviews
* citeseer maintenance / log rotation
* danth email for project
* semeval sub email
* pubs updating
* emma email
* jcdl slides from csail pres
* us taxes, turbotax
* phys mail bounce followup: tatsuya, merry
* check collaborators tyltheng in wing
* ipm review
* talk to laizs about loose machines.
* talk to sanjay about sabbatical/leave
* airs
** blank cd and sleeves distributed
** student reg pushed to AVC
* figure out the thread bug in the mapreduce.rb (hopefully)
* reply to drago, dAnth
* mcomp apps
* citeseer df/mount to cte only/log gz
* 5244 class preparation
* finish ijcai reviews
* look over idm stuff (too high level for dl?)
* query cache revamp
* finished building caches for old pdfs, ppts, cache
* 5246
** got folder
** answer discussion questions
** tut 1
** new renamed corpus
* upali def and slides
* register cny lunch
* meetings
** ziheng on duc/textgraphs
* photos uploading
* chimetext
** upali
** xinyi announce
** booking
* wing-news
** prep, hendra update
** add subscribers
** sent update
* more phd apps
* gp review: linlin
* filming for research
* danth postings
* editing
** ss paper
* ngo rec
* phd app review
* jcdl bids
* found that pdf995edit is really just running standard pdftohtml with option -c turned on.
** pdftohtml with -c invokes gs which sometimes causes problems with creating a .ps file that is neverending in size. solution: thread it in a ruby call and terminate pdftohtml process if it doesn't terminal after 30 seconds.
** still have problems with pdftohtml -c creating files with garbage symbols. Not sure how do deal with this.
* pdf995edit pipeline working - but no longer needed with pdftohtml pipeline in cte
* PPTExtractor pipeline working but still dies on some files.
* work on iptables in citeseer.comp
** iptables -I INPUT 1 -i eth0 -p tcp -s 137.132.81.27 -j ACCEPT
** iptables -L INPUT
* don't forget to save iptables to file. use session save (google this)
* fixed up citeseer iptables and sshd_config
* got the rp 5 to 0305 grant finished. yay!
sans paper comments
lrecARC x1
lrecParsCit x1, subbed
1101: lecture notes, pe, pe done, emails
bank transfer blog 06
3243 sub final
hiranmay meeting
bartneck reimburse
dongxiang grp
booking hanoi
nanba: waiting on eepeng
vietnam form
lmthang urop: partial
dac08: partial
* citeseer
** mounted getRange.pl mod
** get range for 710-714, 720-729.
** finish slices 710-719 and put on danth@rp
** fix wiki pointers
* 5244
** uploaded lecture notes
** grading - continues
* mapreduce.rb
** mod to pick random free machine
** include other processing, ps2pdf, slow and fast pdfbox conversion
* tomm editing - di first edits, tl edit, di's second edits.
* url segmentation
** get - data from webbase
** get + data from citeseer-metadata060816
* wing people needs zhaojin - fixed
* cacm work.
* bing liu's talk
* continues web spider of www.comp.nus.edu.sg
interviews, shortlisting
1101: grading, exam moderations
simone prep
chime text prep
i2r seminars
icpc task
lr: l-bfgs integration
3243: grading, exam moderation
* wing
** group meeting room established
* reimbursements for cluster
* ss
** fixed fsv with tooltips, floating nav
* editing
** hendra's acl
** long's acl
** yee fan query probe
* writing
** hci proposal
** ss paper
* meetings
** yl ntu
** ben 3243
** vu exit
** tung hiring
* finish collecting all pdf forms for AIRS. Missing one CRF. Requested volume number from LNCS
* CHIME text seminar web page updating.
* 5244
** finished grading survey papers
** put up discussion questions
* airs cd and program booklet to burn
* url retraining for pub within domain
* ijdl
3243 tutorials - still missing trees in NLP tut 8, and questions 1 and 2.
installing missing software
mohan xiangyu meeting
do axs
do loudspeaker in lecture room
did 1st ver of lrec
1101
- post summ - done
- student cm problem - still not done.
- xtra problems
- fraction representation
- gcd problem
* give ht proceedings
* upload proceedings to wing and send admin request to update rpnlpir
* sent out recommendations
* import cds
* reimbursements: bang visa, bang www, hlt-naacl
* crp to nght, comments on it
* tyf renewal
* chuats slides for mit
* qe grading
* 5246 grading - final exam, hw2 regrades
* labor day holiday
* acm s045 copyright fax
* fedex question
* do q to ore about housing ownership
* jcdl reg / hotel book
* last meetings: ziheng, jesse
* bang practice and slide edits
* hendra acl edits round 5
* weps edits round 2
* do acl preview emails
* xuan's editing
* tois editing
* running diff between old and new cs metadata
* kw.rb script for emma
Hari seminar 2
grad apps finished
shanheng proposal
acl/jcdl reimbursements
3243 prelim lecture note upload
picked up potteries
10 digit iu
updated wing address, typos, pubs
wing group picture from phone
reimbursement claims
hari hmm
taslp sent, yay
* rewired tppt2pdfListing to be getCiteSeerPDFs.rb - fetches files directly from citeseer.cs now using citeseerMetadata.tsv as bridge.
* lunch boxes for coling acl
* fixed citation search URL in citeseer
* lan man's presubmission thesis
* grant writing
* w11 media lecture upload
* turned in grant proposal
* turned in moderators report
* updating cache with new cs entries
* creating acl metadata
* feeding diff cs metatdata to ssSpider.rb
* feeding acl metadata to ssSpider.rb, done
letters jie yang, chris yang
jiang jing chimetext seminar
do ijdl reviews assigned to self, done!
claims in progress
write philip about registration
print van de sompel paper
correlations to minh
hendra's paper revision
lta/samsung runaround
jie yang invite
gordon mohr wapi, kristine ijdl, brent ho's shipping request emails
set up fin08
lta license renewal
jin annotation: started
sigir reg meetings
do final report for nlp web q
prep/pack ijcnlp
qiul related work chapter
write isaac councill
pick up monitor
build raw text preprocessor
restart emma project
ACL09 or AAAI-spring 09
check on acl anthology fixing request
do personal message
az, cfc: tf*idf other features approximated
http://www.informatics.sussex.ac.uk/research/groups/nlp/rasp/index.html
moves jien-chen wu et al. computational analysis of move structures in academic abstracts
NetDraw www.analytictech.com/download.htm
Ask Gordon about web services architecture.
cm bachelors coord: questions and clarifications
http://www.textfrompdf.com/tfpspeed.htm
http://jabref.sourceforge.net/
Citeseer: think about parsehed alignment
http://statgen.iop.kcl.ac.uk/bgim/mle/sslike_3.html
do tenure list
* finished with course pack stuff, more or less
* updated firstcycle
* emma related work chapter proofreading
* citeseer progress up to 637/730 = 80%
* filed student claims
* dell pie.ddns service to helpdesk
* denny paper proofreading done
* SERC HFE abstract done. Done with slides, too. submitted
* printed ijcai papers for review
* robot pics for cs3243
* wangxuan's lor
* anubhav's lor
Fredo Durand
easier to author
3d model -> line drawings
coded apeture to get better model on blurriness
digital photograph reprocessing
* PREMIA meeting at NTU
* HP grant stuff with tancl, dpoo
* finally done with CACM article draft (I think)
* reset cte server (again!! :-( :-( )
* more sms collection mods
* looked over hidetsugu's data. Looks fine and well formatted. Gotta go convert them now.
* Aug2006PremiaMeeting
* editing
** taslp (edit, send out)
** shiren
* xs
* moderation
* proposals
** cu
* bbq prep (see 27 nov)
** CS profs = 15, IS profs = 1, guests = 1, WING = 10, family = 3.
** sent emails
* dongwon visit prep
* franz visit prep
* yi sok goong and sam sok goong visit prep
* almost finished with airs publication stuff. whew
* slideseer stuff also working out
* did npic poster, finally
* emma icadl paper
* bpowley acl init
* hari invite
* server movement
* group meeting prep
* admin hrs claim
* qiul ext abs jair
* finish basic icadl emma sub
* transfer danth files to rpnlpir
* edit radev's wiki at umich.
* writing getRange.pl for slice extraction.
* canteen stall ~500
* acl anthology
* tomm editing at night
* xinyi defense and slides
* 3243 sent mt
* 5246 lect 5, tut2 send out
* group meeting
* premia newsletter
* editing
** ziheng poster
** aaai web paper (ergin)
** bang's paper
* reviewing
** karthik
** lan man
what am I supposed to do today? Hmm. NPIC has to be brought online, poster must be printed out.
would be really nice to get getSummaryStats working to generate some sample pages for inspection.
Wow it's a month already. Gotta get moving! Yikes!
The new monitors are now working just fine. Yay.
''Lucene updates''
Got lucene to work with webapps
Modified basic script to index titles and search on titles
Modified basic script to show date and path info
Got the highlighter module to work with the contents of the document. Note that the field being highlighted must be stored (Field.Store.Yes or Field.Store.Compress)
Got highlighter to work with webapp, sort of.
POI PowerPointExtractor, ridiculously easy to run. But can we do better? Need to check out the rest of it.
* ivor tsang seminar CVM and meeting
* send out updates
* luzheng rec
* 5246 prep, demos, lecture 8 and 9, discussion questions, more on homework #2
* update services
* update chimetext
* group meeting and archiving
* draft hyps
* update hyp.html
* spc ann
* dbs ann fee
* simone los, edited
* duc dung urop call
* ecdl initial editing meeting arrangement.
* parsCit
** parsCit back up: pdf partially working
** crf++ download
** installed locally to cte:~/crf++/example/parsCit
** cvs into kanmy/parsCit
** crf adaptation
** xsl viewing
** template and length output
Trying out a journal. What is it anyways? A blog of sorts?
lmthang x1, ch 5 and abst, title
dac2008 x2
1101 mu lects
3243 demos
book ijcnlp flight bangalore - started
simone book hostel - started
simone 12/13 schedule - started
pmp email
vet rogerz exam: photocopy solutions
* vu updated the rpnlpir page.
* trung fixed the lacie disk, got back disk.
* im meeting with Denny
* cl lab meeting with Wee Sun, Hwee Tou and students
* lecture notes for tomorrow's class done. yay!
* put thumbnail of coursepack.
* one run czppt2txt
* lecture notes should contain MORE EXAMPLES to illustrate for each technique or concept
* homework requirement are NOT VERY CLEAR
* math not explained well enough
* assignment load TOO HIGH
* references need to be prioritized
* should cover some cutting edge methods
* explain terminology better (WordNet, MiniPar)
* should ask questions more clearly
* lecture should finish on time
* need application to real search engine
* want tutorial answers before final
* want tutorial questions and notes earlier
* logistics for course pack can be better (in one spot, with correct materials)
* proofread slides and tutorials
* team assignments
* math a prerequisite compared to hypermedia
textbook adoption form
fix stalled cte: another hdd failure and iptables problem
turn in hari's keys
find graphics key
mtjoseph data slices 397 398 399
udl interviews
sub wangdong
turn in forms for trung and tung
sub hari report
wing group meeting and prep
kymn email
update WING project page
get robots from graphics lab
sub irj
dale seminar
icadl
ijit review
yee fan
lrec
tech report jeprab
tech report isaac
admin meeting
* slideseer buildCoordinatedMedia done
* OAP application for Dongwon to visit in Winter?
* finishing NPIC webpage / demo stuff
* back from hk
* www debrief / chimetext conf preview sched
* jair review printout
* feedback analysis
* vldl
* asiamiles claims
* away on personal leave for past week
* expenses done
* airs2006 archive
* exam moderation and submission
* email and physical mail handling
* ijdl invitations out
* correct proposal
* IP for NPIC
* bill brody meeting (x2)
* 5244 emails / bill brody email / graphreading manual mail / personal emails to alan, tatsuya
* bang proofreading / yee fan reply / xuan's thesis 1-6
* revised related work acm / update
* set up hp printer
* room reservation for graphreading / group meeting / dongwon seminar
* deal with dongwong lee visit wrt hosf
* check timing on the graphreading
* erik recommendation
* prep grad course joint poster session
* d/led hw2 to grade for later
* AIRS 2006 post conference archive ok and email out.
* brought check for hsbc
* some xs
* updating pubs page / new ppt files d/l and converted
* deal with graph reading new location in 15th S16 (04-33) / 22nd in MR 3 (SoC 1 05-28) send out mail
* ask cath about india interns
* emailed lecturers for poster session
* more airs 2006 stuff
* run pdfbox
* finally wrote back to Lee
* updated cvs copies.
* premia stuff
* copying data out of 300 GB transport disk (takes hours/ usb 1 too slow bleack !!)
* mirror acl.ldc.upenn.edu
* met with faezeh
* met with tancl, dpoo wrt hp grant
* first pass at tsv'ing the dblp xml data
* ordered sigmod silver anthology
* thioachie
* MOE oct large-scale / A*star
* 5246
** try making rbr collection
** slides for wk 1
* library visits - reservations, return reserves
* ng hong i def
* buy vga connectors
* simone conv
** friday
* yl conv
** monday, thursday
** updated stuff for pams
** proposal 1 uploaded to pams
** 1pg cv
* ss
** tomcat redev started
** lucene basic text index
* meetings
** hari meetings
** hyp meeting
** group meeting email
** group meeting latex tutorial
all UROP/HYPs out
3243: released Hw2
contact susan silva of ece
Citeseer: finish copying, back up
bleong baby 50
3243 grade corrections to gradebook
traffic fine erp
sent scanned version to simone
Anthology: EACL sftp setup
vldl late notification and reviewer reminders
Reading/Editing
* vldl reviews: x4
* zhaojin grp
* tanyeefa's prop slides
* qiul ijcnlp
send mailed version to simone
acl reviewers
did ora young researcher award
taslp again
dhl reimburse
SIGIR: set papers
tcs: send reports by week end
tsinghua: wingnews
CS 1101:
* tutorial post
* PE
* grade consol, plag
PREMIA: forum post
* finished runs of pdfbox on acl anthology
* danth errands
* citeseer 710 slice pdfbox conversion
* WING admin and group meeting
* 5246 last lecture, demo setup, exam answers, hw2 setup
* RoR book printing
* chime text sched
* claim forms for admins
* emnlp reviews
* ng hong i thesis rev
* cuihang best thesis nom
* mcomp apps: round 1 (28/28) round 2 (9/29)
* read kalpana's chapters (2/4)
* wing / chimetext / wingnews reminders announce
* tancl/chuats area report
* redo duc copyrights commits
* bang / zhiqiang letters
* group meeting
* hendra funding and tutorial and workshops
* hendra's cam ready
ror
udl r2
tata
dyhsu meeting
hang acm diss award
hsbc
basic hari sched
go see doctor
new annote guide for kzl et al x2
icadl printouts prep
wx hyp prop
hari accom round 2
taslp editing round 1
ijhcs review
firstcycle.org renew
chris gil wedding return
widm review
setup 3243 website basic
mmies review
irj review
Rashid M. Abdalla and Simone Teufel
* semi-fixed cue phrases: semi fixed : find syntactic variants
* don't use thesaurus as not strict synonyms
* toss out negation, and usage done by others, previous mention. ''Q'' but why?
do you learn these. ''Q'' how to handle multiple word expressions verbs "narrow down" Is this based on RASP?
* eval on precision at 1.
similar to relation finding. agichtein and gravano, ravichadran and hovy => IE
problem: cannot easily find negative examples
Bollegala et al.
- segment = defined a la text tiling,
- svm to combine data points
- similar to hac in binary bottom-up agglomerative but recovers an ordering as well.
- automatic evaluation is a version of bleu
Masaaki Nagata, Kuniko Saito, Kazuhide Yamamoto and Kazuteru Ohashi
prev work: tillman-zhang 05
their new model: {monotone,reverse}{adjacent|gap}
gapping wrt the source language.
gapping appears often in p SMT in japanese/english pairs (verb final)
''hendra'' read this. especially in terms of training sentence coverage and reordering and gap histograms
Cheng-Zen Yang, Che-Min Chen and Ing-Xiang Chen
* sarawagi's 03 cross trained svm
* q:
Christoph Tillman and Tong Zhang
Che Wanxiang (Min Zhang)
Has SRL Demo
dictionary from PropNet and VerbNet
like other SRLs doesn't handle/tag copula as no predicate
use CoNLL dataset (post processed from PropBank)
vector rep of parse tree <# of subtrees of config 1, # of subtrees of config 2... config n>
exponential number of features
use kernel function to solve dot product (after collins / duffy, moschitti 2005?)
the idea: split path info and constituent portion into two feature spaces then linearly combine
problem/observation noted: constituent too big
validate using only wsj sections 2-5
do soft margin classification by tuning C.
my observation: not all subtrees in constituent are useful. they use rule 1 in preprocessing to remove most of constituent tree.
Hung-Ming Yu, Wei-Ho Tsai and Hsin-Min Wang
* background music reduction
* observation: query should match from line start not middle (using BIC)
Yupeng Fu, Rongjing Xiang, Min Zhang, Yiqun Liu and Shaoping Ma
PDD = person description document
Idea: build description of person and then do retrieval on these documents
describe person using keywords
* web listing pages of people form context for each description
* word pair is basically a density based metric?
* based on bm25
q: how about blogs? resume? they say gen web docs not clean.
''got best result in trec enterprise 2005'' on expert finding task.
Chi-Ho Li, Minghui Li, Dongdong Zhang, Mu Li, Ming Zhou and Yi Guan
handling long distance reordering
use syntax to do this
key idea: generate n-best reordering to be used at decoding time, rather than 1-best
incorporate prob of reordering as another feature in the decoding log linear model
if they use only 1-best, shows negative effect. Needs to use multiple n-best in order to capture
q (dekai wu): trinary productions actually can be explained as a binary productions
Toshiyuki Shimizu and Masatoshi Yoshikawa
Benefit: benefit is geq child elements
Effort: independent of query, less or equal to sum of reading effort of child
similar to set cover alg of shiren in summarization
Dmitry V. Khmelev and William J Teahan
- In SIGIR '03
- highlighted, printed and filed
- related to plagiarism detection, webpage similarity, corpus verification, PARCELS.
Simple repetition of text substrings for plagiarism and duplicate detection. The formula involves computing a concatenated suffix array for an entire set of documents. The idea is to use not only the single longest common substring but a sum of the longest common substrings across all prefixes of a target document.
The R measure is apparently good not just for duplicate detection but also for authorship detection in the test corpora demonstrated in their paper.
To think about: how to adapt this measure to have an effective (and speedy) tool for web page fragment classification and classification.
too many online teaching resources
* instructional architect (IA)
* lets people wrap text around nsdl resources
small grain resource can fit in, require teachers to use/wrap text around, improvise
large grain resources don't fit as well.
Guoping Hu, Jingjing Liu, Hang Li, Yunbo Cao, Jian-Yun Nie and Jianfeng Gao
different features across different entity search types.
4 features in their approach:
1) word features (tf*idf); 2) position features; 3) title; 4) structure (tree / section processing).
Jin-Kyu Park, Eenjun Hwang and Yunyoung Nam
Do CBIR for (tree) leaf images
Other ways:
* leaf contour (perimeter)
* center counter distance (distance from center to edge)
Instead, use leaf vein shape and contour to do CBIR. Extract vein contour shape.
do corner detection to detect intersection/branching and ending point.
li haizhou 1:50-2:15
min zhang 2:30-2:55
mstislav maslennikov 2:55-3:20
Min Zhao, Hang Li, Adwait Ratnaparkhi, Hsiao-Wuen Hon and Jue Wang
for metasearching ranking of different search results
* use standard bm25
* also use click through distribution to rank, learn by NB.
* probably not too helpful as-is on web queries where noise is a larger concern.
Zhaoqi Chen, Dmitri Kalashnikov, Sharad Mehrotra
-use entity relationship graph as well as intrinsic sim
-do consolidation
-handles robustness
www.ics.uci.edu/~dvk
S Kriewel and N Fuhr
good list of toolbox for doing acad search, need to think about these for auto methods
- what suggestions didn't work?
- explaining instead?
- q: sparseness, strategy for ff extension for google
- firefox extension
It's a reading day!! Hooray!
printed out most related work on alignment.
going to read them now -- wow mostly are from Japan on the PRESRI system, it seems.
It's still surprisingly hard to find relevant papers, even for this project. Gotta think about how to find appropriate venues for searching.
Let's see (I should really dump these to citeulike but I'm not going to bother):
* Automatic Slide Generation Based on Discourse Structure Analysis (IJCNLP 2005) - Shibata, Kurohashi: deep nlp on raw text (not necessarily scholarly texts) - discourse analysis (intra, inter sentence). It's really summarization since they just simply split the resulting text into multiple slides after every 12th line.
* Automatic Slide Presentation from Semantically Annotated Documents: Utiyama, Hasida (ACL Coref workshop) - uses Global Document Annotation tags (GDA) GDA approximates today's task of semantic role labeling. Uses topic detection by non-stopword bigrams and frequency threshold of 2. Then use spreading activation to the network (syntactic stuff represents links). Slide generation is a bit more interesting. Namely, they use redundancy removal and coref pronomalization and other editing to make the slide more fluent.
Note that both of these approaches don't explicitly use corpus information which we are.
Genre detection:
* Automatic Detection of Survey Articles (ECDL 05)- Nanba, Okumura: use pHITS plus text features such as title word, cue phrases, and their own citation types. Best non pHITS features are cue and title words.
Hmm, how about alignment itself? Let's start with slide/paper. Note that Hayama et al. reference quite a bit of other work but all of it is in Japanese. Help!
* ''gotta read this one again!'' Alignment between a Techinical Paper and Presentation Sheets Using a Hidden Markov Model - Hayama, Nanba and Kunifuji (AMT 05): Improves Jing's model by using content analysis. They observe three problems with Jing's model: 1&2) deleted and added words cause problems for HMM (this was observed by Jing herself; see summary below), 3) Similar word sequences happen very often in slides, causing problems with prob estimation using Jing's heuristic rules. They improve the approach by:
** degree of alignment: using the slide as a whole bag of words rather than as a word sequence. this is a good idea, similar to doing simple vsm (only on the slide side though).
** considering match sequence length: a longer sentence match should match better. I think this is better encoded as an alignment gap penalty, something like affine gaps.
** alignment using position gaps: their position gap constraint is a bit like our diagonal constraint. I wonder whether it helped much.
** using heuristic rules for titles: title words get a bonus.
However in the end, they only get a minute improvement, over 49 presentation/document pairs.
* Detection and Resolution of References to Meeting Documents - Popescu-Belis and Lalanne (MLMI 05): using anaphora resolution to do alignment. not much said here that really links their technique to performance. Gotta go read their ICDAR 2005 paper and their CIKM 2004 paper.
* Using a Bi-modal Alignment and Clustering Techniques for Documents and Speech Thematic Segmentation: Mekhaldi, Lalanne and Ingold (CIKM 04): they are actuall considering the opposite problem - ''improving'' theme segmentation (a la TextTiling) using bi modal alignment information. They do single modality thematic segmentation first, then do similarity calculation across //nm// units Note that their segmentation is done simultaneously across both media. Then clustering via k means and reprojection to each single media.
And for abstract to paper?
* Using Hidden Markov Modeling to Decompose human-Written Summaries - Jing (CL 02): (according to Hayama et al): generates a HMM from the word sequence in the summary to predict the position / occurrence of corresponding word in the source document. Only considers lexically identical words. Considers 6 possible alignments, giving higher probability to heuristically more plausible alignment. 1) same sentence, adjacent word, 2&3) same sentence, but next or back words, 4) within window of next sentence, 5) within window of back sentence 6) otherwise. Problems with inserted words (corrected by postediting phase, section 3.4) by finding isolated words that have been misaligned.
* SpectralClustering - one problem here is that it assumes no direction, doesn't (naturally) model the fact that the paper is the source of the presentation.
* JumpAndRewrite - PBHMM phrase based HMM by Daume and Marcu; Jing
Implementations so far:
* text sim methods implemented: jaccard, bigram + unigram jaccard, bigram only jaccard, cosine with only TF, unigram
* align methods: jing with majority rounding, max.
* mmStats: records relative jump probabilities including na probabilities as in daume + marcu's work
Implementation must also handle non-aligned slides.
* Example
* Slides at the end of a presentation, backups
* Outline slides
* Conclusion and question slides.
Frederick G. Kilgour
JASIST 51(1):74-80, 2004
- Relevant to: Known Item queries, query rewriting
- Printed and Filed
- Available at: LINC
This paper goes into historical detail on past query retrieval studies on known items. Kilgour investigates known-item query studies from the era of card catalogs. Some notable results distilled from this survey of earlier work includes facts useful for our current study of known item queries: Tagliacozza et al. (1970) notes that users had a higher likelihood of having correct title information rather than correct author information. Also that title searches are more common in today's OPAC than in the older card catalog systems, although I concur with Kilgour that this is largely an artifact of only having limited title entries in the card catalog system.
Torsten Zesch and Iryna Gurevych
wikipedia category graph wcg
conclude small world, scale free graphs
wikipedia categories "mostly" organized by hierarchy - how do distinguish?
Jovan Popovic
give higher-level primitives for creation of shapes
edit at higher-level
capture shapes and 3d at higher level
Penelope Sanderson, Queensland
regularization to prevent overfitting in language model for unseen data.
regularization by lasso methods = optimize versus a loss function
Boosting is a greedy algorithm that has overffitting. Employ shrinkage to minimize this problem.
Incorporate overfitting into Boosting by proposing Boosting Lasso (BLasso). Forward and backward steps, where backward steps allow model simplification while continuing to minimize Lasso loss
Agenda:
Look up Koh Hian Chye
Techsource partnership?
copy premia website to laptop to bring to agm
wording for certificate for student ICPR awards
nominate people
jenny lim from STB
6:30 arrive
Elena Filatova, Vasileios Hatzivassiloglou and Kathleen McKeown
contributions
* create an evaluation measure for domain template extraction
1. verb centric: starting point is to identify most important verbs using frequency based methodologies. Verb instance frequency (VIF, something like tf.idf for verbs wrt classes). how about RF for text categorization or log odds?
2. IR find all sent with top k (= 50) sentences and parse for syntax
3. mine out trees, looking for syntactic agent/patient. ok, trying to recover semantic roles w/o semantic role labeler. a bit odd.
4. generalize tags (didn't do this for verbs so its also a bit strange). But do it for subject object and other roles in a two step procedure => 1. sub NE for general tags 2. merge frequent subtrees
5. union all tuples to form the scenario template. ax out any that are not specific to the domain.
the template extracted have ordering constraints (because they didn't use a role labeler) and seem to consist of 2 tuples only. they call these ''slot structures''.
* can we get their training and testing data sets for comparison?
Lee et al. (UCLA group)
* info vs nav only
* used cs queries originating from UCLA only (as group is most knowledgeable about own queries)
* used click distribution and anchor text distribution
** click distribution with thresh \tau = 1.5 gives 80% accuracy. Note most misclassification occurs with info queries here).
** anchor text with thresh \tau = 1.0 gives 75% accuracy. Note most misclassification occurs with nav queries here).
* also has summary of kang and kim's work (anchor usage rate, query term distribution and term dependence)
Simone Teufel, Advaith Siddharthan and Dan Tidhar
* citation function + ''short summary''
* annotation guidelines done w/o rt domain knowledge
* 12 cats
* sims not frequent, as contrast is then expected
''BUG'' - how about higher level support only ?
* anaphora? sentence discourse effects?
* bonnie: domain specific
* john prage: information digitalization/visualization
Ben Wellner and James Pustejovsky
prob: find head of two args of a discourse connective
uses penn discourse tree bank (PDTB)
heuristic pruning for cand selection to keep complexity down
log linear rerank model integrating simple model for both arguments independently then re-ranking
diff error types for arg1 arg2.
SN Koh
robust speech recognition, denoising
AURORA corpora
future work: bilingual speech recognition
seems to be work joint with li haizhou
This directory contains scripts to start and stop the cs system as well as invoke the various daemons that run facilities for cs.
* start-citeseer
* stop-citeseer: the following three services do not stop when cs is stopped through this script. I adjusted this script to change it to kill off queryd, rankd and dlLimiterd as well.
* queryd
* rankd
* dlLimiterd
The speedy cgi caches the perl scripts and does not (seem to) notice changes unless all of cs is shutdown. You have to stop-citeseer and start-citeseer for changes in binDirectory or libDirecotry to be reflected.
Wei Wang, Kevin Knight and Daniel Marcu
GHKM Galley et al. 2004
problem with ptb trees being n trees and arguments that don't exactly match
not rule binarization, but actually binarization source lang trees
have to decide which way (l/r) to do binarization, or since syntax based, try head based binarization.
their sol'n: do all binarization, save in a forest
yuck: they have forest based alignment to adapt GHKM.
better idea: find best using adapted EM
seems a bit like compensation artifact of needing to use CKY binarization constraint
Class Assocation Rules (CAR) but with min support and min confidence pruning turned off.
reason: people want to see the context of the rules
helpful in motorola case study
showed engineers want to see actionable rules = short, two to three attribute rules with trend analysis
Makes me think of limits of human perception. Bing asked us about the expressability of the hypothesis space given his rules only give about 2-3 attribute rules. I think classic ML theory deals with hypothesis space quite well. Am I mistaken?
Naveen Sivadasan
large corpus, 15M abstracts, .5M citations added/yr
given corpus, extract interaction of prot, molecule, enzyme
has workflow
search for experimental conditions, methods
relevant patterns
search / extract / ask q / find essense / ''correlate''
over increasing vol of data
over diff source of data
under time pressure
domain knowledge presentations in 2nd day
Franz Och (Google)
* auto scoring helps community / using standard corpora
** best mt system achieve near- or beyond-human bleu scores (google arabic english translation 110% of human, but human translation actually nice).
** bleu score favors statistical systems - only large improvements are statistically significant - 25k word test corpora.
* standard model architecture - 1) stat word alignment 2) phrase based smt 3) log linear feature model 4) discriminative training to optimize against eval metric
* to date: nlp systems not used - hurts performances, really hard to integrate (wu&carpuat05 - > doesn't help to use WSD)
* current problems (!!)
** named entitites
** dictionaries/data - OOV
** morphology
** syntax - wrong dependencies
** word allignment
* translation models: 100 m to 1 b, language model (monolingual) 5 b > 1 t words
* data scaling
** 2x monolingual + .5 BLEU, 2x parallel + 2.5 BLEU (% BLEU)
* google cluster infrastructure - barroso et al IEEE Micro volume 23, issue 2 march 2003.
* ''language model requires only few number of bits?? (down to 4 bits)''
* target for rare features
* sentence specific LMs - project LM or phrase table to target sentence.
MapReduce apps:
* lm training
* EM training for word alignment
* phrase extraction
Diversity versus Universality (Tsuji): sentence not as significant in Asian languages
wa, koso, sura, dake, mo (case and topic/theme marking together in one particle) => topic packing different?
Asian anaphoric ref and other discourse refs don't map well to English. semantics of "sentence" conditional on context (more naturally occurring in CJK)
Linguistic Differences for NLP (Tsou): higher entropy. genitive construction -> ambiguity.
Corpora (Bhattacharaya): morphological mining: syncretism (exhibiting same surface form for different cases). syncretism dealing with ambiguity / WSD. homology. Idea -> deal with less entropic languages first.
(Bird): digital divide in asian languages. interesting students in native language topics. publication quality for applications in different languages.
(Calzolari): Basic Language Resource Kit.
(Maxwell): minor languages deserve special session.
* children want entirely different way to read books -> scrollable interface
* survey 12 children and supporting adults
* have available 152 transcripts and tech report
* many languages may it difficult for an interface
* preferred reading physical books but use ICDL for searching (especially since device too expensive to use, operate and insure)
transition to use the book or to use the device?
Chao Wang, Michael Collins and Philipp Koehn
syntactic reordering. rule based system with weights?
local reordering not very effective in terms of improvements in BLEU, perhaps due to phrase table capturing it.
did a study of which rules at are applied.
Stanford - Stat MT, NER, Speech Reco (Bayesian Inference Models)
New project: looking at collaboration networks and analyzing work for seeing whether interdisclipinary centres really worth $$$
- coauthor ship
- textual analysis
Pascal RTE participation - slide 3
* not just textual overlap
s4:
~tilde operator (semantically similar)
* can we build such tools for sentences, paragraphs?
parse tree to dependency parse relations
Current work:
* predicate argument structure
* natural logic => very cool (broadening ok, narrowing ok in negative contexts)
* downward montonicity comes from negation in implicit contexts too.
s45 give to qiul natural logic
--------------
evidence based retrieval
complexity of analysis
data structures for indexing
how to increase coverage
CiteSeer (hereafter, cs) is divided into several directories. The main ones are bin/, db1/, lib/, papers/. Several of the directories in our distribution are empty.
cs is written entirely in perl, and has a lot of legacy code. It has been refactored into a number of perl modules, which are contained in the lib/ subdirectory.
We can look at each of these directories separately:
- BinDirectory: scripts for starting, shutting down and running maintenance on the cs system.
- DB1Directory: databases for keeping the forward and backward links used by the system. About 50 gbs or so.
- LogDirectory: logs of the searches and other things that happen (index, spidering events) in citeseer.
- LibDirectory: common perl libraries for running things in bin/.
- PapersDirectory: the bulk of cs, cached copies of all the papers it indexes that it kept. About 1.8 tbs.
To get an idea of what happens when a query is issued you might also want to look at the tiddlers that are tagged as {{{walkthrough}}}. They walk through one screen and focus the procedure calls to get the information that are output to the screen.
Rainer Lienhart and Alexander Hartmann
J Electronic Imaging 11 (4), 1-0 (Oct 2002)
Available: in paper only, LINC has some problems getting this through OpenURL
Relevant to: image classification, non-photographic image classification
Printed, highlighted, and filed.
The authors examine web image classification over a database of 300,000 images. They divide the non-image categories into presentation slides, comics/cartoons and other. Our classification for Fei's project is a bit more comprehensive but not motivated by corpus study.
The work is a machine learning feature oriented work, achieving high accuracy using simple image only (raster based) features. Colormap and proportion of the picture wrt to the colormap seems to be some of the most salient features.
For the top level photo/non-photo classification, AdaBoost was used (similar to our work) and feature pruning is inherently done through decision stump feature selection, The highlighted features show that the four features for classification include: 1) total colors 2) what is the prevalent color 3) fraction of pixels with distance > 0 (f1) and 4) ratio of of f1/f2, where f2 is similar to 3) but using a high threshold rather than zero. Surprisingly, edge detection (an expensive feature) doesn't appear to be too useful. All selected features were based on the colormap and not on the locality / placement of the pixels in the image. Dimension features were not used.
For the non-photo classification, text proves to be an important feature, and they capitalize on their group's previous research to detect text. Here, edge detection proves to be the second most useful feature after aspect ratio, which according to Table 3, accounts for over 95% accuracy. This leads me to believe that an optimized Hough transform for only vertical lines may be able to be used, to lower the complexity of the feature extraction. Also, presentation slides exported from powerpoint and others might be detectable by their embedded metadata rather than raster data properties.
Strengths:
- demonstrates that colormap features are a very strong key for non-photograph image classification.
- also does some error analysis that illustrates some borderline cases.
- uses only jpeg compressed images for their study.
Weaknesses:
- no information about the kappa or percentage agreement between assessors. It is presumed that the task is easy and 100% doable.
- 95% + accuracy in non-photograph classification only subdivides into two classes: comics vs. presentation slides.
Katsuro Inoue (Osaka Univ.)
clone can be motivated by efficiency (copy code rather than proc call which is more expensive)
* type 1 - identical
* type 2 - as given example (clones can have different identifiers but largely are structurally identical)
* type 3 - semantically sim, but syntax very diff
* Program Dependency Graph (PDG) vs. AST.
** sub-graphs are idnetified as code clones R Monodoor and S Horwitz Using slicing to identify duplication in source code ISSA
* AST ref: ID Baxter A Yahin et al. Clone Detection Using Abstract Synax Tree ICSM 98. - made commercial tool: CloneDR.
** Shortcoming: syntax of two progs need to be syntactically comparable
* Metrics
** by binning - J Mayland, C LeBlanc and EM Merlo Experiment on the automatic detection of function clones in a software system using metrics
* Token-based
** T Kamiya et al. see ToSE CCFinder - canonicalize identifiers then index using suffix array to find repeated. finds lcs subsequences. http://ccfinder.net/ccfinderx.html
** Libra - IR system for source code fragments - also SparsJ
Mike Goffley - Maintenance
* also see cp-miner ToSE in march 2006 and MOSS Aiken's UCBerkeley system
Background: #ffc
Foreground: #000
PrimaryPale: #fc8
PrimaryLight: #f81
PrimaryMid: #b40
PrimaryDark: #410
SecondaryPale: #ffc
SecondaryLight: #fe8
SecondaryMid: #db4
SecondaryDark: #841
TertiaryPale: #e88
TertiaryLight: #c66
TertiaryMid: #944
TertiaryDark: #633
Vasudeva Varma
With Demo by Prasad Pingale
Encoding ingest to Unicode8
(2nd byte preserves phonetic similarity) artifact of Unicode
Search in CLIR
Indian Lang have larger alphabet (all phonetically)
Lots of spelling variations
/***
|''Name:''|CryptoFunctionsPlugin|
|''Description:''|Support for cryptographic functions|
***/
//{{{
if(!version.extensions.CryptoFunctionsPlugin) {
version.extensions.CryptoFunctionsPlugin = {installed:true};
//--
//-- Crypto functions and associated conversion routines
//--
// Crypto "namespace"
function Crypto() {}
// Convert a string to an array of big-endian 32-bit words
Crypto.strToBe32s = function(str)
{
var be = Array();
var len = Math.floor(str.length/4);
var i, j;
for(i=0, j=0; i<len; i++, j+=4) {
be[i] = ((str.charCodeAt(j)&0xff) << 24)|((str.charCodeAt(j+1)&0xff) << 16)|((str.charCodeAt(j+2)&0xff) << 8)|(str.charCodeAt(j+3)&0xff);
}
while (j<str.length) {
be[j>>2] |= (str.charCodeAt(j)&0xff)<<(24-(j*8)%32);
j++;
}
return be;
};
// Convert an array of big-endian 32-bit words to a string
Crypto.be32sToStr = function(be)
{
var str = "";
for(var i=0;i<be.length*32;i+=8)
str += String.fromCharCode((be[i>>5]>>>(24-i%32)) & 0xff);
return str;
};
// Convert an array of big-endian 32-bit words to a hex string
Crypto.be32sToHex = function(be)
{
var hex = "0123456789ABCDEF";
var str = "";
for(var i=0;i<be.length*4;i++)
str += hex.charAt((be[i>>2]>>((3-i%4)*8+4))&0xF) + hex.charAt((be[i>>2]>>((3-i%4)*8))&0xF);
return str;
};
// Return, in hex, the SHA-1 hash of a string
Crypto.hexSha1Str = function(str)
{
return Crypto.be32sToHex(Crypto.sha1Str(str));
};
// Return the SHA-1 hash of a string
Crypto.sha1Str = function(str)
{
return Crypto.sha1(Crypto.strToBe32s(str),str.length);
};
// Calculate the SHA-1 hash of an array of blen bytes of big-endian 32-bit words
Crypto.sha1 = function(x,blen)
{
// Add 32-bit integers, wrapping at 32 bits
add32 = function(a,b)
{
var lsw = (a&0xFFFF)+(b&0xFFFF);
var msw = (a>>16)+(b>>16)+(lsw>>16);
return (msw<<16)|(lsw&0xFFFF);
};
// Add five 32-bit integers, wrapping at 32 bits
add32x5 = function(a,b,c,d,e)
{
var lsw = (a&0xFFFF)+(b&0xFFFF)+(c&0xFFFF)+(d&0xFFFF)+(e&0xFFFF);
var msw = (a>>16)+(b>>16)+(c>>16)+(d>>16)+(e>>16)+(lsw>>16);
return (msw<<16)|(lsw&0xFFFF);
};
// Bitwise rotate left a 32-bit integer by 1 bit
rol32 = function(n)
{
return (n>>>31)|(n<<1);
};
var len = blen*8;
// Append padding so length in bits is 448 mod 512
x[len>>5] |= 0x80 << (24-len%32);
// Append length
x[((len+64>>9)<<4)+15] = len;
var w = Array(80);
var k1 = 0x5A827999;
var k2 = 0x6ED9EBA1;
var k3 = 0x8F1BBCDC;
var k4 = 0xCA62C1D6;
var h0 = 0x67452301;
var h1 = 0xEFCDAB89;
var h2 = 0x98BADCFE;
var h3 = 0x10325476;
var h4 = 0xC3D2E1F0;
for(var i=0;i<x.length;i+=16) {
var j,t;
var a = h0;
var b = h1;
var c = h2;
var d = h3;
var e = h4;
for(j = 0;j<16;j++) {
w[j] = x[i+j];
t = add32x5(e,(a>>>27)|(a<<5),d^(b&(c^d)),w[j],k1);
e=d; d=c; c=(b>>>2)|(b<<30); b=a; a = t;
}
for(j=16;j<20;j++) {
w[j] = rol32(w[j-3]^w[j-8]^w[j-14]^w[j-16]);
t = add32x5(e,(a>>>27)|(a<<5),d^(b&(c^d)),w[j],k1);
e=d; d=c; c=(b>>>2)|(b<<30); b=a; a = t;
}
for(j=20;j<40;j++) {
w[j] = rol32(w[j-3]^w[j-8]^w[j-14]^w[j-16]);
t = add32x5(e,(a>>>27)|(a<<5),b^c^d,w[j],k2);
e=d; d=c; c=(b>>>2)|(b<<30); b=a; a = t;
}
for(j=40;j<60;j++) {
w[j] = rol32(w[j-3]^w[j-8]^w[j-14]^w[j-16]);
t = add32x5(e,(a>>>27)|(a<<5),(b&c)|(d&(b|c)),w[j],k3);
e=d; d=c; c=(b>>>2)|(b<<30); b=a; a = t;
}
for(j=60;j<80;j++) {
w[j] = rol32(w[j-3]^w[j-8]^w[j-14]^w[j-16]);
t = add32x5(e,(a>>>27)|(a<<5),b^c^d,w[j],k4);
e=d; d=c; c=(b>>>2)|(b<<30); b=a; a = t;
}
h0 = add32(h0,a);
h1 = add32(h1,b);
h2 = add32(h2,c);
h3 = add32(h3,d);
h4 = add32(h4,e);
}
return Array(h0,h1,h2,h3,h4);
};
}
//}}}
update summary task focus
lm background model building by waterloo
umd
multiple alternative sentences compressions - like fergus
using grammar as restrictor
mmr version accounts for synonyms - use e-f-e bridge for paraphrase generation with frequency
eval
pyramids - what is included what is not good for error checking
still quite big gap between top pyramid and gaps?
some way to see which summaries really hard
d0739 - eg. really hard to id
2008 every November colo proceeding TREC (monday and tuesday)
- submissions due in summer
- held at Gaithersburg
- collab with IR researchers
- send hoa email about any related duc work just citation not pdf
- biggest change - qa comes to duc from trec
- entailment / inferencing track?
- main task dropped
- update task to become primary task, to include noisy documents in 2009
- longer sequence is really difficult because event evolution is generally short
- each cluster would still have same amount (eg 10) articles per cluster
- nist assessors to do nugget based / pyramid based annotation and scoring
- majority cost is in making 4 manual summaries
pilot task
- blogs opinions = doc is a post plus comments on it
- univ of glasgow
- clustering opinions and summarizing them
- maybe a classification task? IE task rather than summarization generation task?
- assign wangxuan to do this task?
- how to create a summary of a blog from nist? lucy asked about this.
- paul jones ibiblo
john conroy did linear fit with responsiveness with rouge (automatic). if we improve rouge do we improve content responsiveness.
lucy: reflection on duc / post duc analysis
what differentiates work in duc vs trec?
- we do assessment without assessment pooling
- we do fluency evaluation
Transformation based dependency parsing
transform trees for the purpose of learning to better help replication.
* change prague style annotation to melcuk style annotations. melcuk style annotation gives more tree dependency encoding, which might help in learning trees.
* focus on coordination and verb groups
originated in prog language - aho 1969
2006 had some papers for nl to lf
- pruning n best list by entropy? nah - argreement, systemic syntax problems. David says introduce syntax.
- rank3 trick from both sides?
- get cites for synch cfg for pls.
Chaomei Chen et al. (Drexel)
information overload - find who's at the forefront of the research
accounting for the timeliness of information is the key aspect of this work
especially when dealing with large datasets as in astronomy (gray&szalay 2004) sloan digital sky survey (sdss)
ackerman study why key areas explode:
* increase in multi author paper
* presence of one or small group of seminal papers
Use '''H index''' to classify, but modify to include H_c H_t
* where S_t measures the recent impact for relatively recent citations than earlier ones
* where S_c is adjusted for publication age
use burst detection, only looking at surface text
*
/***
|''Name:''|DeprecatedFunctionsPlugin|
|''Description:''|Support for deprecated functions removed from core|
***/
//{{{
if(!version.extensions.DeprecatedFunctionsPlugin) {
version.extensions.DeprecatedFunctionsPlugin = {installed:true};
//--
//-- Deprecated code
//--
// @Deprecated: Use createElementAndWikify and this.termRegExp instead
config.formatterHelpers.charFormatHelper = function(w)
{
w.subWikify(createTiddlyElement(w.output,this.element),this.terminator);
};
// @Deprecated: Use enclosedTextHelper and this.lookaheadRegExp instead
config.formatterHelpers.monospacedByLineHelper = function(w)
{
var lookaheadRegExp = new RegExp(this.lookahead,"mg");
lookaheadRegExp.lastIndex = w.matchStart;
var lookaheadMatch = lookaheadRegExp.exec(w.source);
if(lookaheadMatch && lookaheadMatch.index == w.matchStart) {
var text = lookaheadMatch[1];
if(config.browser.isIE)
text = text.replace(/\n/g,"\r");
createTiddlyElement(w.output,"pre",null,null,text);
w.nextMatch = lookaheadRegExp.lastIndex;
}
};
// @Deprecated: Use <br> or <br /> instead of <<br>>
config.macros.br = {};
config.macros.br.handler = function(place)
{
createTiddlyElement(place,"br");
};
// Find an entry in an array. Returns the array index or null
// @Deprecated: Use indexOf instead
Array.prototype.find = function(item)
{
var i = this.indexOf(item);
return i == -1 ? null : i;
};
// Load a tiddler from an HTML DIV. The caller should make sure to later call Tiddler.changed()
// @Deprecated: Use store.getLoader().internalizeTiddler instead
Tiddler.prototype.loadFromDiv = function(divRef,title)
{
return store.getLoader().internalizeTiddler(store,this,title,divRef);
};
// Format the text for storage in an HTML DIV
// @Deprecated Use store.getSaver().externalizeTiddler instead.
Tiddler.prototype.saveToDiv = function()
{
return store.getSaver().externalizeTiddler(store,this);
};
// @Deprecated: Use store.allTiddlersAsHtml() instead
function allTiddlersAsHtml()
{
return store.allTiddlersAsHtml();
}
// @Deprecated: Use refreshPageTemplate instead
function applyPageTemplate(title)
{
refreshPageTemplate(title);
}
// @Deprecated: Use story.displayTiddlers instead
function displayTiddlers(srcElement,titles,template,unused1,unused2,animate,unused3)
{
story.displayTiddlers(srcElement,titles,template,animate);
}
// @Deprecated: Use story.displayTiddler instead
function displayTiddler(srcElement,title,template,unused1,unused2,animate,unused3)
{
story.displayTiddler(srcElement,title,template,animate);
}
// @Deprecated: Use functions on right hand side directly instead
var createTiddlerPopup = Popup.create;
var scrollToTiddlerPopup = Popup.show;
var hideTiddlerPopup = Popup.remove;
// @Deprecated: Use right hand side directly instead
var regexpBackSlashEn = new RegExp("\\\\n","mg");
var regexpBackSlash = new RegExp("\\\\","mg");
var regexpBackSlashEss = new RegExp("\\\\s","mg");
var regexpNewLine = new RegExp("\n","mg");
var regexpCarriageReturn = new RegExp("\r","mg");
}
//}}}
Alexander Yates, Stefan Schoenmackers and Oren Etzioni
* QA parsing does implausible parsing, phrase attachment
* ''BUG'': look at text runner paper paper 2006. is any of their tuple data available
* chris manning: differentiate between fishy parsing and value of using the web data.
* have four filters to try to prove/correct parse
Jeffrey Pomerantz, Sanghee Oh, Barbara M. Wildemuth, Seungwon Yang, Edward A Fox
ask jeff for modules
http://curric.dlib.vt.edu/
Danushka Bollegara, Yutaka Matsuo, Mitsuru Ishizuka (ECAI 06)
find key terms for each namesake in a collection
assume one namesake per article (one sense per discourse)
c-value nc-value need to read k t franzi s. ananiadou
used group averaged HAC to do clustering
compared against baseline TF*IDF. hard to say whether it actually does well; no analysis.
Saif Mohammad and Graeme Hirst
* only look at distributional similarity
* use coarse sense to limit complexity (1000^^2^^, instead of 100000^^2^^)
* first pass build base WCCM: category and a word -> just get the primary sense?
* is their human data available? papers on edge weights?
* eval: do word correction in context; doing hirst and st-onge correction ratio
* Q: what about antonyms?
Bryan Chee and Bruce Schatz
use lexical network
then use graph clustering algorithm to find config with best modularity
cluto clustering software C++
http://beespace.uiuc.edu/
PMI as an indicator of what? -> collocation?
use only mid frequency terms then do PMI calc
Xiaojun Wan, Jianwu Yang and Jianguo Xiao
Not actually ranking, but re-ranking initially relevant documents
* texttile all documents first
* manifold ranking structure
* tiling done on query "doc" as well, queries are from LDC TDT-3
** compared doc + MR and doc + TextTiling
** only seems to work for few (<75) initial documents.
George Buchanan (Swansea) (with fernando)
- information literacy in ugrads changes all the time
- info needs (as serendipity may push new goals from focus goal)
judging first glance relevance
needed to give relevance ratings rather than 1/0, unlimited time
result list method different when present
how many documents? 20 docs
paper: 23
result: 17 (1/5th time use just result list)
electronic folder: 28
headers and captions more useful than title - "titles can be misleading"
conclusion works for paper but not for electronic mode, why?
figures and tables more looked at.
34% docs never scroll and looking only above the fold (1/3 of page)
64% only first page
search using ctrl-f not much
scrolling more painful with electronic mode
doc length - more negative effect on paper
"first glance <10 sec" - relevance made - other time to only confirm
q: unlimited time really correct exp setup?
q: interference in context to first glance relevance?
q: image navigation?
All cs applications have to first initialize and read the database. After which the $did variable can be altered (e.g., set to 1 - MAXDOCS) to retrieve the hash of information for the document.
Go to 1.html of the server. It should show the document "36 problems for semantic interpretation" by Scheler. There are several areas of the results page:
The light blue title box, containing:
* a left hand panel:
** title
** author
** hyperlinks
* a right hand panel
** view or download
** cache links
** from (source) information
The body, containing:
* summary
* grey rating box
* abstract
* similar documents
* bibtex
* citations
* samesite
* hyperlinks about online articles
* cs credits: these are generated by the $footnote variable.
Aside from the cs credits, these are generated from the call to DocumentToHTML
DocumentToHTML calls DocumentDownloadBar as
&DocumentDownloadBar ($hit, $param, \%SourceHTML, $sArticleLink);
then the code below is from DDB:
my $sURL = $$hit{'url'};
if (defined $$hit{'homepages'} && $$hit{'homepages'} =~ m/\b(?:url|file)\s*=\s*(\S+)/si) {
my $url = $1; if ($url =~ /^(http|ftp):\/\// && $url !~ /\?/) { $sURL = $url; }
}
PSHREF is a variable that is the actual url written. it gets a new value from the sURL variable that hits the logging facility of cs.
my $sPSHREF = &RedirectHREF ($sURL, $param, $profile::nDownloadValue, 'Download'); # my $sPSHREF = &RedirectHREF ($sURL, \%param, 0, 'Download');
if ($$hit{'locatedon'} !~ m/^0 /) {
print $sPSHREF . &URLShortLinkMax ($sURL, 36) . "</a><br>\n";
}
Yee Seng Chan and Hwee Tou Ng
1:37 - 1:55 - 2:02
EM predictor
sense priors - saerens et al 2002 em based method
count merging - active learning to assign diff weights for diff examples
- em part not even worth explanation too brief
- count merging need to be tuned, how tuned?
- results given by % but not with respect to training data or abs counts, how are they similar?
- - sense prior part quite confusing
- purple triangles not the same number
- nb based wsd - what about for other methods?
- count merge helps more for less data, why?
[[EditSf0|file:///M:/public_html/knmnynWiki.html]]
[[EditMacStick|file:///Volumes/IMATION%202G/knmnynWiki.html]]
Robert Capra, Gary Marchionini, Jung Sun Oh, Fred Stutzman, Yan Zhang
topic, genre, region, format facets for bureau labor stats data
three interface style: orig web site, relation browser, simple facet browser + breadcrumb
no breadcrumb trail
three task types: simple lookup (one facet w/ time + place), complex lookup, exploratory
bookmarklet to capture answer and page and highlighted span
lookup tasks are as easy using simple stuff as well as complex lookup
explore seems better on relation browser
two way anova
no significant difference on all almost all facets of the evaluation
qualitative use was better
* users recognize / value good organization
* ''need'' keyword search (maybe first before browse?)
Steven Garcia and Andrew Turpin
d gap analogy for document retrieval
cluster related documents together in index
brought up time sensitivity messing with results - actually seems quite stable.
50% to 300% speedup by access-ordering
Wisam Dakka and Luis Gravano
search results to generate multi-doc summarizatins
desiderata
*informative snippets: highligh essense
*browsing ability: navigate, to related stories
*speed: fast, online
baseline
* offline summary and match (irrelevant)
* or online summaries and clustering (slow)
hybrid
* reuse old clusters, merge w/ newly generated clusters
* online clusters and offline clusters together
* generate summaries for top k clusters
how to grade clusters? decide which to use for a query?
idea: use cluster-level and query level features - to build boolean classification
M Pasquier NTU
- avatars both in graphics as well as animatronic
- intelligent tutoring system - affect based feedback (fielded in primary schools)
Type the text for 'EnableAnimations'
Debra J. Slone
- Available from JASIS 51 (8):757-773, 2000
- Printed and Filed (with known-item project bibliography)
- Relevant to: Known Item searching, OPAC, query strategies, user studies.
This paper examines the searching strategies of library users by performing a study and questionnaire of 32 library patrons. Slone examines three major types of library queries: area searches, known-item searches, and unknown-item searches.
My summary only concentrates on the known-item searches, since that is the focus of our current research. In the abstract, Slone writes that known-item searches experience "the most disappointment" and are characterised by "simplicity".
1. Finding results in the OPAC doesn't mean that the known-item query is satisfied. Slone shows that even in cases where a known-item search finds a resource, it may not be the desired one. A question then is "how do we figure out what the proportion of correct answers is?"
2. Not satisfied with other material. The patron wants this resource specifically and not others. Also dissatisfaction when the book is unavailable. This leads to critical and negative opinion of the library OPAC. My opinion: this suggests that OPACS that can infer that the current query is a known-item one should present circulation information right away (and suggest alternatives for finding another copy if one is not available, e.g. hold, ILL, substitution for an area search, alternative titles).
3. Known items may not be retrieved because of spelling errors. Children in public libraries may be most affected by this. Also spellings of author names.
4. Confidence and frustration levels differ for known-item vs unknown-item search. Confidence levels are higher and frusteration lower for known-item searching.
CMU
LDA, HLDA, HMLDA
parametric models:
* event model (latent stationary)
* motion model (latent, transition)
* sensor model (observation model)
hardy-weinberg equilibrium
dp equivalent to polya urn process - draw balls / replace with 2 or draw from base distribution
exchability property
coalescence tree - ewens sampling formula
multiple clustering in hlda
hierarchical lda - replace draw for new color by picking from upper level lda.
* ancestor mutation rate: ancestor specific? what actual mutation rate?
Daniel S Leite, Lucia H M Rino, Thiago A S Pardo, Maria das Gracas V Nunes
brazilian portuguese summarization = using thes + stop + stem and texttiling
Wenjie Li (HK) et al.
* use Event Terms (ET) and Named Entities (NE) in graph structure; links are undirected
* extract ET and NEs
* tuple based: NE connected by ETs. (ETs are verb phrases?)
* ask ''Linzihen'' to look into this.
* why R(ne,ne) are considered inter-event relevance?
* anaphora resolution used? Canonicalization
Philipp Koehn and Hieu Hoang
motivated by morphological analysis.
factor could be any syntactic or semantic information at the token level
factor can then get rid of incosistent hypothesis
do lemma then generate surface form from there
q: bilingual frequency based morphology for phrase based features?
Radu Florian, Hongyan Jing, Nanda Kambhatla and Imed Zitouni
* mention detection - NER plus chaining with pronoun, NP resolution
* multi label classification in NER.
* adding in partial data
* tried all in one (AIO), joint (hybrid) model and cascade model
* cascade model does better, able to model features better and mitigates search space label explosion that happens in the AIO.
Edward Loper
applicable to all models doing viterbi search
idea: search search space using dp
pieces not limited to input and output basic elements but could apply to sets?
idea: pre and post process output (observation) sequence before feed to learner
post process back to get original model prediction
use FSTs to transformation (pre/post) process
easy mod to make but why does this make sense?
q (jason eisner): you can do backoff in model selection but what about in your model?
q (hal daume): how about for trees?
Matching: knowledge base:
Binding: associate unmatched blocks
p-delimiter
Joining: feature: average number of terms.
http://www.dcc.ufam.edu.br/~eccv/flux-cim
Yang Liu, Yun Huang, Qun Liu and Shouxun Lin
add forest to tree rules
q: speed of translation?
q (kevin knight): what are default rules?
Pei Hsia, Jayarajan Samuel, Jerry Gao, David Kung, Yasufumi Toyoshima and Cris Chen
IEEE Software
- Available through LINC
- Printed and filed
- Relavant to: scenario analysis.
Discusses scenario engineering for system design. Gives an example of a PBX switch. An okay introduction to scenario analysis for an outsider. Uses FSM notation to encode a scenario (that is the "formal" side of the paper).
I had to read this for an MSc defense to get an introductory view of the area.
not core research
overview of demand space
anyone can search? ala ratatouille
driven by search success, people asking for enterprise search
char:
web relies on pop not correct
none or little explicit linkage
security btw docs
mixed content, diverse source, different access methods
3 apps: alert, qa, searchRel
domains now: insurance, fin/business/banking, tele:ringtoneSearch
how to browse fin models?
how to detect true/fake emails/data
domains future: publish/new media, legal, intranet
advertising, customizing text books
case management and discovery
BusinessIntelligence BI user-driven / too hard to use, secretaries do interface job
To get started with this blank TiddlyWiki, you'll need to modify the following tiddlers:
* SiteTitle & SiteSubtitle: The title and subtitle of the site, as shown above (after saving, they will also appear in the browser title bar)
* MainMenu: The menu (usually on the left)
* DefaultTiddlers: Contains the names of the tiddlers that you want to appear when the TiddlyWiki is opened
You'll also need to enter your username for signing your edits: <<option txtUserName>>
Daniel Gopher, Technion
* information technology
* knowledge engineering
example: cognitive trainer for basketball players (applied cognitive engineering => ACE)
* bird's eye view - coach view
* make high pressure decisions in fluid manner by training via video game
trends
* towards ''networked model'' - many users connected by many servers from anywhere (from imada and associates)
* richest people in the world do knowledge based jobs
* towards individualized tv / narrowcast mediums
hfe questions
* good theory is necessary to explain : perception, memory, problem solving, attention limis, acquisition of skills and knowledge, and metacognitive processes
application areas
* cell phone
* MIS
* distance based learn
* medical system
* VR
How hazardous is health care? very. from some journal?
Martin Helander, NTU
perhaps better titlted as HFE in NTU?
* team goals and efficacy
* design process understanding / modeling
* process control : things are complex / lots of data in different units
* cognitive demands
Piotr Indyk
Randomized Dimensionality Reduction
flattening lemma - Johnson-Lindenstrauss lemma 84
push high dim data points to random hyperpane
can embed points in log n space / epsilon; not conditional on dimensional space, but on size of points
related to chernov bounds but using normally distributed RVs instead of boolean ones
these bounds are not specific to any "good" randomized projection, though
sketching/streaming streaming algos
to construct offline note best algo does it in O(dd') time
what about online?
update d wrt to only new stuff
* streaming can accomodate "jaccard coefficient" (semantic similarity) see broder 97
how about knn?
voronoi works fine for 2-d but theoretically doesn't work for high dim
in practice use kd-trees
todo: look this up: locality-sensitive hashing (LSH)
code: crc handbook 03
code: on piotr's webpage
refs:
s. muthurishnan / web (madalgo)
s. vempala
q: relation to kernel
q: relation to lsi/pca? anything in the middle?
q: sg-itg really a gap
q: wu li: lexical words
Dekai's leads:
marker words: patrick jules, andy white
HinD (Hindi Deconverter)
Deconversion -> transfer then generation (not 'standard' generation task, more like text to text translation)
has lexical choice problem
parent child ordering preferences as well as sibiling order
UNL mostly for translation of english to other languages rather than for into english
TiddlyWiki supports all kinds of formatting options. Note this documentation is taken from [[TiddyWikiTutorial|http://www.blogjones.com/TiddlyWikiTutorial.html]].
*You can create ''Bold'' text by enclosing it in pairs of single quotes:
{{{
''bold text''
}}}
*You can create ==Strikethrough== text by enclosing it in pairs of equal signs:
{{{
==strikethrough text==
}}}
*You can __Underline__ text by enclosing it in pairs of underscores:
{{{
__underlined text__
}}}
*You can create //Italic// text by enclosing it in pairs of forward slashes:
{{{
//italic text//
}}}
*You can create ^^superscript^^ text by enclosing it in pairs of carets:
{{{
^^superscript text^^
}}}
*You can create ~~subscript~~ text by enclosing it in pairs of tildes:
{{{
~~subscript text~~
}}}
*You can change the text's @@color(green):color@@ by enclosing it in pairs of at-signs (@@@@) and specifying a text color:
{{{
@@color(yourcolorhere):colored text@@
}}}
*You can change the text's @@bgcolor(red):background color@@ by enclosing it in pairs of at-signs (@@@@) and specifying a background text color:
{{{
@@bgcolor(yourcolorhere):your text here@@
}}}
{{{
< ScriptAlias /cs "/export/bulk/citeseer/cs/bin/cs"
< ScriptAlias /compress "/export/bulk/citeseer/cs/bin/compress"
}}}
RewriteEngine on
{{{
< RewriteRule ^\/(details\/)?([a-zA-Z\-]+)([0-9]+)([a-zA-Z\-]+)(\.html)?$ /perl/cs?q=dbnum\%3D1\%2Cqtype\%3Ddetails:$2^$3^$4 [T=application/x-httpd-cgi]
< RewriteRule ^\/(details\/)?([a-zA-Z]+)\-([a-zA-Z\-]+)(\.html)?$ /perl/cs?q=dbnum\%3D1\%2Cqtype\%3Ddetails:$2^$3 [T=application/x-httpd-cgi]
< RewriteRule ^\/did\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Ddocument: [T=application/x-httpd-cgi]
< RewriteRule ^\/([0-9]+)(\.html)?$ /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Ddocument: [T=application/x-httpd-cgi]
< RewriteRule ^\/article\/([a-zA-Z\-]+)([0-9]+)([a-zA-Z\-]+)(\.html)?$ /perl/cs?q=dbnum\%3D1\%2Cqtype\%3Darticle:$1^$2^$3 [T=application/x-httpd-cgi]
< RewriteRule ^\/article\/([a-zA-Z]+)\-([a-zA-Z\-]+)(\.html)?$ /perl/cs?q=dbnum\%3D1\%2Cqtype\%3Darticle:$1^$2 [T=application/x-httpd-cgi]
< RewriteRule ^\/citations\/(.+)(\.html)?$ /perl/cs?q=$1\&cs=1\&submit=Search+Citations [T=application/x-httpd-cgi]
< RewriteRule ^\/documents\/(.+)(\.html)?$ /perl/cs?q=$1\&cs=1\&submit=Search+Indexed+Articles [T=application/x-httpd-cgi]
< RewriteRule ^\/verify\/([0-9]+) /perl/cs?account=activate&verify=$1 [T=application/x-httpd-cgi]
< RewriteRule ^\/context\/([0-9]+)\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CGID\%3D$1\%2CDID\%3D$2\%2Ccluster\%3Dnone\%2Cqtype\%3Dcontext: [T=application/x-httpd-cgi]
< RewriteRule ^\/contextsummary\/([0-9]+)\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CGID\%3D$1\%2CDID\%3D$2\%2Ccluster\%3Dnone\%2Csummary\%3Dyes\%2Cqtype\%3Dcontext: [T=application/x-httpd-cgi]
< RewriteRule ^\/cidcontext\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CCID\%3D$1\%2Ccluster\%3Dnone\%2Cqtype\%3Dsamecite: [T=application/x-httpd-cgi]
< RewriteRule ^\/track\/([0-9]+)\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CGID\%3D$1\%2CCID\%3D$2\%2Ccluster\%3Dnone\%2Cqtype\%3Dtrackgid: [T=application/x-httpd-cgi]
< RewriteRule ^\/cachedpage\/([0-9]+)\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cpage\%3D$2\%2Cqtype\%3Dcachedpage: [T=application/x-httpd-cgi]
< RewriteRule ^\/cached\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Dcached: [T=application/x-httpd-cgi]
< RewriteRule ^\/pdf\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Dpdf: [T=application/x-httpd-cgi]
< RewriteRule ^\/djvu\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Ddjvu: [T=application/x-httpd-cgi]
< RewriteRule ^\/ps\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Dps: [T=application/x-httpd-cgi]
< RewriteRule ^\/site\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Dsamesite: [T=application/x-httpd-cgi]
< RewriteRule ^\/update\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Dcorrectabstract: [T=application/x-httpd-cgi]
< RewriteRule ^\/updatecache\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Dupdatecache: [T=application/x-httpd-cgi]
< RewriteRule ^\/dcache\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Ddcache: [T=application/x-httpd-cgi]
< RewriteRule ^\/addcomment\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Daddcomment: [T=application/x-httpd-cgi]
< RewriteRule ^\/comments\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Dcomment: [T=application/x-httpd-cgi]
< RewriteRule ^\/editcomment\/([0-9]+)\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Ccn\%3D$2\%2Cqtype\%3Daddcomment: [T=application/x-httpd-cgi]
< RewriteRule ^\/trackdid\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Dtrackdid: [T=application/x-httpd-cgi]
< RewriteRule ^\/n?related\/([0-9]+)\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CGID\%3D$1\%2CDID\%3D$2\%2Ccluster\%3Dnone\%2Cqtype\%3Drelatedgid: [T=application/x-httpd-cgi]
< RewriteRule ^\/n?relatedgid\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CGID\%3D$1\%2Cqtype\%3Drelatedgid: [T=application/x-httpd-cgi]
< RewriteRule ^\/check\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CGID\%3D$1\%2Cqtype\%3Dcheck: [T=application/x-httpd-cgi]
< RewriteRule ^\/correct\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Dcorrectdid: [T=application/x-httpd-cgi]
< RewriteRule ^\/correctgid\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CGID\%3D$1\%2Cqtype\%3Dcorrectgid: [T=application/x-httpd-cgi]
< RewriteRule ^\/articles\/([0-9]+) /perl/cs?q=$1\&cs=1\&submit=Search+Indexed+Articles [T=application/x-httpd-cgi]
< RewriteRule ^\/surveys\/([0-9]+) /perl/cs?q=$1\&cs=1\&submit=Search+Indexed+Articles\&ao=Hubs [T=application/x-httpd-cgi]
< RewriteRule ^\/rawtext\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Drawtext: [T=application/x-httpd-cgi]
< RewriteRule ^\/citetext\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Dcitetext: [T=application/x-httpd-cgi]
< RewriteRule ^\/n?bib\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CCID\%3D$1\%2Cqtype\%3Dbib: [T=application/x-httpd-cgi]
< RewriteRule ^\/n?gbib\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CGID\%3D$1\%2Cqtype\%3Dbib: [T=application/x-httpd-cgi]
< RewriteRule ^\/n?dbib\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CDID\%3D$1\%2Cqtype\%3Dbib: [T=application/x-httpd-cgi]
< RewriteRule ^\/n?abib\/([0-9]+)\/([0-9]+)\/([0-9]+) /perl/cs?q=dbnum\%3D1\%2CCID\%3D$1\%2CGID\%3D$2\%2CDID\%3D$3\%2Cqtype\%3Dbib: [T=application/x-httpd-cgi]
< RewriteRule ^\/recommend$ /perl/cs?recommend=Yes [T=application/x-httpd-cgi]
< RewriteRule ^\/rd\/([^\/]*)\/(.*)$ /perl/cs?profile=$1&rd=$2 [T=application/x-httpd-cgi]
< RewriteRule ^\/compress\/([^\/]*)\/(.*)\/([^\/]*)$ /perl/compress?e=$1&f=$2 [T=application/x-httpd-cgi]
< RewriteRule ^\/cache(\/.*)\/([^\/]*)$ $1
< RewriteRule \/r(s?[0-9])\.gif$ /icon/r$1.gif
}}}
John Carroll, PSU
1980 - breaks off from hfe as separate discipline (hfe on control and command systems; engineering)
things to look up: activity theory, activity domain, affinity task, activity awareness
* phenomenon of usability, measurement and management methods, populations studied
* user experience (not just performance; but emotion and presence; funology)
* creativity
* community / cscw - collective efficacy
* scaling up - multilevel simulation models
* elderly population and workforce - the world is ageing; intergenerational teaming;
ui techniques and technologies
* web 2.0 / semantic web / collaborative web - look at bridgetools.sourceforge.net
* multi-display workstations / also interactive displays - allocation of display area to different displays
* speech and gesture rec / nlp and vision for gesture recognition
* multiscale GIS - discrete change in resolution -> rough search at the global level and accurate seach at the local level
* recommender systems / especially in scholarly dl tasks
hci has moved beyond the office
* games / tools for making games -> visual and audio effects
* knowledge mangement / knowledge is unshared by organizational changes / dealing with information overload / just in time knowledge
* collaboratories - gary olson olsen? univ of michigan?
* community informatics - internet augmented proximal communities or communities of interest
* health informatics - temporal information is not explicitly used; team collaboration
* ubicomp - personal devices / always on & connected and spatially aware
* abowd and mynatt ubiquitous computing in tochi
** gestural methods
** no explicit tasks
the issue of contexts
* ui/interaction focus vs studies of work in situated areas. (SOUPS in privacy/security)
activity awareness
* sharing activities (mid level) versus status information (low level)
* regulating shared praxis
* applications include project management, emergency project management
* shared editing in
closing remarks:
* "easy to use" changed to fun, accessible, provocative UIs
* one person changed to group, organization
* workstation changed to ubicomp devices
Kenneth Kwok (DSO)
metacognition of systems (awareness)
- reinforcement learning
- robust to surprises
cyc knowledge base
- awareness on multiplatform multi machines?
bayesian framework for long-term knowledge "fragments"
dynamic reasoning demo - adaptive machine learning port to other scenarios
SPADE for discourse
Minipar for dependency relationships
fusion of both for srl
JHU - Tim DiLauro -> Alex
SDSU - Reagan Moore
UCSD
Scrippts Oceanography Center - Steve Miller
Lamont-Dougherty (Columbia University)
National Climatic Data Center (Bruce Baxter)
National Geophysics Data Center (Ted Haberman)
World Metereological Organization (WGO/UN)
leads World * Data Centers
Sloan Digital Sky Survey
UNavCo (GPS; Check Meetens)
IRIS (Seismology; Tim Ahern)
NCAR Community Data Portal (Dan Middleton)
Virtual Solar Observatory Network (Peter Fox)
Mandar Mitra
Which piece of information to use?
* R Precision correlated with MAP
Cranfield (Cleverdon et al 60s) collection methodology
Extending CLEF outwards
GeoCLEF - localized search
ImageCLEF - image search
Spoken document retrieval
FIRE - Forum for Information Retrieval Evaluation
- for Indian Languages
Training data out 2008 May
sub Sep 2008
workshop in Dec 2008
www.isical.ac.in/~clia
chuats group meeting 22 Mar (Shi Rui)
Automatic Image Annotation (AIA)
problem: missing (incomplete) annotations by users
use mixture model, BHMMM.
tried visual features (works well), what about adding text features as well?
Robert Moore
* on word alignment, two stage approach: cheap first stage, refined second stage with additional features
* stage 1 model: added features for backward jumps (count and magnitude)
* (new improved) stage 1 model: rank association scores instead of actual scores, unlinked word features (un aligned words?)
* (new improved) stage 2 mode: replaced log prob with log odds (not clear why this helps??)
* model 4: baseline, does very well
not clear whether features help to bring poor alignments to ok ones or good alignments to perfect ones.
seems to be a bunch of tricks without giving us an understanding of what these features are helping at. ugh.
- ask hendra to read? can we get this system for input to the reordering task?
Sooyoung Yoo and Jinwook Choi
introduce rank-group-kX
formula driven by rank position of weight
Daisuke Okanohara, Yusuke Miyao, Yoshimasa Tsuruoka and Jun’ichi Tsujii
* biomedical NER, general NER already doing okay.
* use semi markov CRF to allow sequential IOB labeling
* but smCRF takes training time proportional to K^^2^^
contribution:
* add features to summarize previous entities or chain information
* use forest/lattice representation to pack features
* use Naive bayes to filter out some nodes? doesn't seem to be that well motivated -- just a shortcut in model training.
Albert Gatt and Kees van Deemter
set reference
expanding the incremental algorithm reiter/dale?
TUNA corpus
spatial and visual cues?
eye track data?
how about a ml version of this?
Erdong Chen, Benjamin Snyder and Regina Barzilay
section and para level model. Does 2 level insertion.
q: (liang huang): hard classification ? a: no aggregate score.
q: how does this simple model do on other non-wikipedia domain?
data available: http://people.csail.mit.edu/edc/emnlp07
Nan Zhou and Gerry Stahl (Drexel)
Virtual Math Teams + mathforum.org
community building
real transcripts of how kids use or need to find
questions: bill kules: will ui be used outside of projects? as ui is quite complex
Rebecca E. Grinter and Leysia Palen
- Published in: CSCW 02
- Relevant to: 5244 session on new media
- Printed, highlighted
16 teenagers were studied for their use of IM. Notable things from this survey include:
* differences between dorm use and earlier, home-based use in terms of connectivity and thus whether being online is a active decision or not. E.g., for teens at home, connecting is a conscious decision and often really does mean that the person is available to chat. This is related in the study to using different customizable strings for showing unavailability.
* Blocking a person results "unavailable" and thus is undistinguishable from actually not being available. This has social implications in shutting out unwanteds from a clique.
* is easily copiable and forwardable like email, forcing chatters to be aware that their conversations might be forwarded outside of the chat. Prefer phone or other harder to track technologies for sensitive correspondence.
* IM interface can be improved as several participants wrote in the wrong IM window, sometimes causing embarassment.
* Used by teens to gossip, coordinate events (especially helpful here) and to do collaboration (on schoolwork).
* Email, phone used to coordinate IM sessions in some way.
* Somehow complementary to SMS. IM used more in the US where SMS isn't widely used.
Wee Ser NTU
collect speech/voice acquisition using camera/microphone array
twin microphone 3d noise cancelling
(working with li haizhou)
multimodal interface with scene activity analysis
Robert Sandserson and Paul Watry
python based
iRODS
(collaborator Ray Larson)
put grid infrastructure to work on DL apps: IR, DM and Text Mining
see overview diagram of architecture in pg 76 (jcdl 2007)
Ingest processor to convert text to xml for document store
record processing for xml processing
process records to create terms
allow for incremental processing of information (for support of data mining)
''genia'' give linguistic stemming rather than porter stemmer
{mesh|www}.cheshire3.org
Ah Hwee Tan
adaptive resonance theory (ART)
Plasticity Stability dilemma
- add in user preferences and other pieces of information
leslie kaebling
knowledge selection
dynamic rules - domain adaptation
michael collins
structured prediction problems
- nlp to logical form
tommi jaakkola
- big model inference
- optimality error and computational
madhu sudan
- coding theory [Juba & Sudan]
Trevor Darrell
lipreading
apperance variation
caltech 101 dataset
- similar to shih-fu's work? part recognition
open content alliance (OCA) - alternative
warc = arc + other metadata (resources aside from just responses)
warc file format = concat warc records
q: sitemaps?
q: hidden web?
Kahle => "kale"
* community web archiving -
* youtube videos -
web archiving for partners - what's a partner?
- contract crawl 300+ URL - LoC
- archive-it collections (smaller)
1 5 min
- greetings and why apply?
2 10 min
- job scope -
- recommender systems
3 10 min
- ajax
- iis vs apache
- python: Questions about map/reduce/sum/etc family of functions
- unit test and integration test
- svn vs git
4 5 min
- questions and pay
- what do you want?
Welcome. It's now <<today>>.
This TW is out of date -- but kept online for archiving reasons.
You might see want to see the more recent versions.
-M
John Willinsky (UBC)
journals that are free - are not the answer
need to handle context encoding - need to see related work
* from newspaper to research articles that need to be free
* from free journals to newspapers (layman)
Layman are already motivated to use DL as they are coming already motivated
Lead to new economic model for knowledge
Stanford encyclopedia for philosophy
* increasing useful metaphors
bureaucrats need to build policy very fast (20 minutes?)
* use open access articles instead of calling last professor at local univ.
questions:
*christine borgman: data needs also to be free. how can we make open access to data as a recognizable and rewarded activity?
* andreas paepcke: how can we make the data free but stay protected from lawsuits?
* dan russell (google): how can assure that disinformation doesn't become the new history? Ans: available and context doesn't lead it, let reader decide
* ed fox: how to take out laziness of people to make it visible? Ans: patience
"first" top most 50 queries worldwide
google - query refinement rate 30% / top level weekly stats for google
33% do one search per week -> less chance to tell people what's going on
00:12 actor most oscars -> 1:15 actor most oscars Academy (missed visual cue)
visually literate user of an interface
ethnographic studies very useful
scholar -> why can I sort by date?
inattention vs. low-signal density
3M model (micro, meso, macro)
* lowest level detail - milliseconds (what are people doing at the millisecond level) - eye tracking
** physical cueing -
** "boolean" -> scary (affective mode of search)
** eye tracking -> going back results box -> dissatisfied with search results
** partial learning when reading search results -> need to go back to re-read search results from before
* mid-level - focus group - minutes to days (in control)
** notice something different -> making stuff up. didn't see anything.
** accessibility -> people go back to resources they use
* millions of observations - days to months - sessions analysis
** monitor size indicates page break
** contrastive reading results - exponential decay less noticeable
** idiosyncratic search behavior - exhibit inventory of behaviors
** well known search tasks -> users use google to do teleporting
informational / directed / closed task - find a painting by geoges seurat
informational / locate - search for a man's watch less than 100 and water proof
mental model of search engine is quite weak but actually not very necessary to be able to use the search engine
* hard for people to replicate the search results to get a previous search, partially because of ambiguity of natural language
* cannot predict when someone does an actual search
4 levels of knowledge
* pure engine technique -> grammar for a specific search engine
* information mapping -> inverse indexing, wikipedia, keyword frequency, reverse dictionary
* domain knowledge -> medical knowledge
* search strategy -> knowing when to shift/stop, narrow/wide
conclusion: search engines need to create a system that behaves predictably
need to educate users in broadly effective models of research, content and organization
* teleimmension - human factors problem - intial tech adopt then abandonment - why?
** eye contact prob. even with small jitter throws off (human cog sys requires small jitter for perception)
** stereo vision needed for reaching distance -> jl's soln: cocodex display/capture on moving platform
* haptics - touch and body sensing takes up a lot of area in the brain
** can control virtual limbs by combining multiple inputs from different dofs (and quickly?)
** arm fatigue from glove use need force feedback -> butler robot
* augmented reality 3 grand challenges - 1) sensor 2) synth 3) speed
* 1908 - e m forrester
* 2nd life - tree of casual interaction, building avatars and actions can be seen within 2nd life, jl says a great deal of success conditioned on this
phrase based HMM.
should allow jump to some para and write sequence of slides and para simulataneously. Where slide and para sequence has to be contiguous.
sim (s_seq[i..j], p_seq[i'..j'])
then the jump probability: by the jing model.
run viterbi to get it. This would be an unsupervised method.
Access keys are shortcuts to common functions accessed by typing a letter with either the 'alt' (PC) or 'control' (Mac) key:
|!PC|!Mac|!Function|
|Alt-F|Ctrl-F|Search|
|Alt-J|Ctrl-J|NewJournal|
|Alt-N|Ctrl-N|NewTiddler|
|Alt-S|Ctrl-S|SaveChanges|
These access keys are provided by the associated internal [[Macros]] for the functions above. The macro needs to be used in an open tiddler (or the MainMenu or SideBar) in order for the access keys to work.
While editing a tiddler:
* ~Control-Enter or ~Control-Return accepts your changes and switches out of editing mode (use ~Shift-Control-Enter or ~Shift-Control-Return to stop the date and time being updated for MinorChanges)
* Escape abandons your changes and reverts the tiddler to its previous state
In the search box:
* Escape clears the search term
chuats group meeting 22 Mar (Zheng Yan-Tao)
abbrev NKD
NKD relationship is transitive and symmetric
but computationally too inefficient
temporal window to cut down keyframe comparison to window w
sol'n: use color autocorrelation, don't use viewpoint or object pose changes because these change easily.
sol'n: use asymmetric hierarchical k-means then use SIFT keypoints (more expensive).
Says might be helpful to do index
k-means filters out noise (filters for precision -> results in high precision but low recall).
Frederick G. Kilgour
JASIST 52(14):1203-1209 2001.
- Available through LINC
- Printed and Filed
- Relevant to: known item query project
Part of a long series (I think six) articles concerning retrievability of book titles in OPACs using various approaches. This constitutes known item searches. Kilgour and his colleagues are trying to identify and prescribe a useful pattern to use to perform known item query searches.
This paper redoes an earlier experiment when Kilgour used a normal keyword search to do a retrieval experiment using "surname plus first and last title words (not including stop words)" to retrieve books. The main finding of the paper suggests that in 98% of the cases in which a monograph (single author's work) is sought the surname + title word retrieves the item's record (if it does exist in the catalog / database).
In this later experiment, Kilgour uses limits on the fields in which the words can be matched by using MARC field restrictions. The new experiments concur with the first, and do not show additional benefit. As such there is little that is new in the experimental results.
A limitation of the first experiment is that the work only examines monographs. Kilgour addresses this by examining multiple authored / edited works. Surprisingly it is shown in an exploratory experiment that the additional author surnames do not assist in retrieval (Table 5).
Kilgour does suggest, as I have also be musing about, that the search results and the record display (Kilgour uses the terms first and second screens) can be combined in certain cases. It takes only a little bit of inference to see that known item query searches are such cases.
To do: would be good to look at our local LINC transaction logs to see how many of our queries match the prescribed patterns.
N.B.: Kilgour uses the NOTIS system, different than our local INNOPAC and different from Slone's DYNIX study.
Anupam Basu (Kharagpur)
NLP/Speech group: Srinivas Rao, Pabitra Mitra, Sudashna Sarkar
Linguistic perspective on pronunciation model but not yet computationalized
rule based systems for nl tts
nl interface for cerebral palsy patients
basically ask the patient to construct interlignual / karaka
corpus creation and annotated for emotion
1000 sentences annotated using fuzzy classification from volunteers
sents drawn from times of india
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och and Jeffrey Dean
stupid backoff model - not context sensitive and also not a probability, don't normalize
this gives you easier space/time complexity to be used in map/reduce
batched n-gram requests
rw: emami icassp 2007
large tsochanarisdis et al. 2004, 2005
soft margin svm
maximize margin between best and 2nd best
auto classification
word fidelity plus brevity penality or too long
mcdonalds 2006 as competitive baseline
q: why not crf?
a: better to max 1-best for compression
q: why care about target side tree features?
a: better features
q: large margin, but forest based?
Malvina Nissim
new / given
then add ''inferrable'' based on previous work
* baseline was human crafted decision tree
* their decision tree system does better but looks like that it can be improved much more
Rada Mihalcea, Carmen Banea and Janyce Wiebe
subjective/objective distinction
using multilingual
general inquirer, sentiwordnet
Youzheng Wu, Ruiqiang Zhang, Xinhui Hu and Hideki Kashioka
answer selection
sim between answer snippet and question
ml-based, unsupervised method
web as corpus style
candidate extraction -> clustering web search results -> classification (all done online?)
clustering by candidate
reinforcement by secondary search query to retrieve more data
use svm classifier using 1) sim based features 2) NE type features 3) question dependent n-gram window features to do the classification
second stage of google search seems to help a lot
q: any idea of when second stage google search helps even more? when fgs doesn't return lots of results?
q: do you control for answers present in TREC itself?
q: (rebecca dridan) what about snippets with multiple assignments?
q: (jason eisner) how to integrate syntax?
q: different training data for different questions?
Dell Zhang and Wee Sun Lee
see lafferty and zhai 02. background sep
treat doc as gen from 2 diff GMs. coin toss unobserved.
need to compare versus standard eval of webkb.
q: hmm modeling of adjacent words?
Le Song (NICTA SMLP, Univ. of Sydney)
key idea: kernel on labels not just on data
KL div (MI) not commutative, introduce Hilbert space embedding of distribution instead
measure sim using mean map in RKHS rather than using MI|KL div
can measure statistical independence
Use P(x,y) in mean map then P(x)P(y) in mean map and measure their difference.
when joint k((x,y)(x',y')) factorizes to k_x(x,x')k_y(y,y') then standard unsupervised problems
* feature selection
* clustering, etc.
then delta is equal to cross-covariance operator Hilbert Schmidt norm
Feature selection algo
* find feature that is most indep to the labels, remove this feature, repeat
* K_x is kernel on instances, K_y is kernel on labels
says that most prev work use linear K_y kernel, but could use others, e.g., K_y= gaussian kernel gives distance as real so can do regression
Clustering
* no label! Have to generate labels yourself. y* = argmax tr K_x H K_y H subject to constraints on y
* algo: K_y = Pi A Pi^t
* k-means uses diagonal A
* use tensor product of two cost matrices at 1st and 2nd level to produce hierarchical or manifold clustering (this part isn't so interesting)
lee wee sun
Inference model for ising models
pomdp
/***
|''Name:''|LegacyStrikeThroughPlugin|
|''Description:''|Support for legacy (pre 2.1) strike through formatting|
|''Version:''|1.0.2|
|''Date:''|Jul 21, 2006|
|''Source:''|http://www.tiddlywiki.com/#LegacyStrikeThroughPlugin|
|''Author:''|MartinBudden (mjbudden (at) gmail (dot) com)|
|''License:''|[[BSD open source license]]|
|''CoreVersion:''|2.1.0|
***/
//{{{
// Ensure that the LegacyStrikeThrough Plugin is only installed once.
if(!version.extensions.LegacyStrikeThroughPlugin) {
version.extensions.LegacyStrikeThroughPlugin = {installed:true};
config.formatters.push(
{
name: "legacyStrikeByChar",
match: "==",
termRegExp: /(==)/mg,
element: "strike",
handler: config.formatterHelpers.createElementAndWikify
});
} //# end of "install only once"
//}}}
The lib directory contains the configuration information for customizing cs in the Config/ subdirectory.
When a query is done to cs, it invokes the QueryPM module to do the bulk of the work.
Athanasios Kehagias, Fragkou Pavlina, Vassilios Petridis
EACL 03
- Also see: http://citeseer.ist.psu.edu/580132.html, and http://acl.ldc.upenn.edu/eacl2003/html/main.htm, and their Int'l J of Intelligent Systems paper
- Relevant to: text segmentation, partition product models.
- Printed, highlighted and filed
An linear segmentation approach that takes into account within-segment cohesion (they call this homogenity) as well as global information about segment length. Characterize global segment length as a normal distribution with (mu, sigma). Uses a balancing parameter, gamma, to weight the two sources of evidence.
They experimented with Choi's dataset (who consolidated much earlier work on segmentation) and use Beeferman's Pk to rate its success. Note that Pk penalizes segmentation mistakes near true boundaries much less than other mistakes (definitely a good thing).
As this approach has a model component that relies on the global text segment length information, it is important to see whether its performance could be an incidental result of using a training/testing corpus that is rather uniform. From the data in Table 1 it appears that this might be the case, as the average segment length is quite uniform in sets 1, 2 and 3.
From the results, the optimal gamma is set quite low, between .08 and .4, indicating that the segment length model is used as a refinement to the stronger cohesion/homogenity component.
Two other results of the paper are intersting and need to be investigated in more depth. One is that the generalized density (parameter r) seems to improve performance. Why this might be isn't immediately obvious to me. Secondly, they note that the segment length is better modeled by words rather than sentences. This intuitively makes sense to me, but again, its not obvious why sentences perform significantly worse.
Finally, the authors' end with a memorable sentences which should be considered: Choi uses local optimization of global information, and Heinonen uses global optimiziation of local information.
Relevance:
- how do we adapt this segmentation algorithm to do hierarchical segmentation for documents on the web?
- what exactly is a PPM role in the algorithm?
This directory holds the logs of transactions to the cs system. They all are added to be various components of cs by appending to the files with a timestampt. Important logs (the ones that happen to be appended to often) include:
* LogCiteError
* LogCiteseer
* LogCSD
* LogEvent
* and others, list unfinished.
Rajeev Singal, IIT-Hyderabad
dependancy analysis good for BI, in comparison to standard chunking. parsing
Paninian framework grammar for Hindi other Urdu/Sanskrit langs
karaka (demand) framework -> sim to dependancy parsing but closer to SRL (but SRL perhaps domain specific)
predicate gives demands on args
sometimes case is marked in morphology, but case markings are ambiguous
they solved using CSP type searching?
use tuple (perhaps RDF?) for doing dependency parsing
empirical observation of practical svm solver gives O(n) - O(n^2.3) - but calculates approximate solution
qp problem with certain conditions can be formulated as a minimum enclosing ball (MEB) problem
including lots of svm type problems (reranking svm, langragian svm, etc)
''core set'' used to calc MEB.
Trevor Cohn and Mirella Lapata
for low density languages
use pairs of bitexts (europal, un docs, etc). to help in this.
very similar to a few other works this year (utiyama and isahara 2007, callison-burch et al. 2005, 2006.
helps to get better coverage when paired with standard translation model as well.
related language families work best.
q: what smoothing model do they use for dealing with zeros in the log linear model?
q (): why interpolation did that much better then mean?
CiteSeer
CRFs
SlideSeer
[[2008|http://www.comp.nus.edu.sg/~kanmy/wiki/knmnynWiki08.html]]
[[Public|http://www.comp.nus.edu.sg/~kanmy/wiki/knmnynWiki.html]]
[[(M:)|file:///M:/public_html/wiki/knmnynWiki.html]] [[(Mac)|file:///Users/NUS/Desktop/knmnynWiki.html]]
[[PrivateWiki|http://www.comp.nus.edu.sg/~kanmy/wiki/private/privateWiki.html]]
[[(M:)|file:///M:/public_html/wiki/private/privateWiki.html]] [[(Mac)|file:///Users/NUS/Desktop/privateWiki.html]]
Deyi Xiong, Qun Liu and Shouxun Lin
goal: reordering
* feature based reordering rather than based on phrase model
* integrates phrase and word based models in one
* similar to zens and ney, but without ibm model constraints
* models boundary words as n grams as features in a log linear max ent model
* lexical features helps but collocation features (although plentiful) doesn't seem to help much. need to think about why that is.
* how to extract committee members?
cscr compueter science conference rankings
features w/ ml
* # of PC
* average number of pubs
* avg coauthors of PCs
* avg closeness, betweeness -> how affects workshops
http://pike.psu.edu/confranking/ give pc list
Sanda Harabagiu and Andrew Hickl
* systems try to deal with filtering spurious answers
* post processor, validator of final answers
* very strong pre-processing module(s): tons of modules and NE types tagged, (non) factive verb tagging (think, deny, plot)
* used additional corpus derived from the AQUAINT corpus.
* using the textual entailment helps to re-rank (as opposed to binary classification)
* get about 25% and 100% performance boost
* BUG: QUAB Harabagiu et al. ACL 2005 -> Shiren: auto generate q/a pairs
* they didn't use a TE for answer proving before??
That's me. [[academic|http://www.comp.nus.edu.sg/~kanmy]] [[research group|http://wing.comp.nus.edu.sg]] [[personal|http://www.knmnyn.com/]]
Yin Leng Theng et al. (+11 authors)
mobile geography stuff, can geotag, annotate and upload photos
java based as well as web based, added pda version of tagging.
pilot study - perceived usefulness
use Technology Acceptance Model and Task-Technology Fit model (TAM/TTF)
was shown to be usefulness. and seems to be highly effective given the models used for analysis.
how about control variables?
kids may want cell phone interface, sms interface?
James Clarke and Mirella Lapata
rw: subtitle generative Vandeghinste et al . 2004
problem: context important - apply to whole doc rather than sentences only
move to discourse oriented model - use centering theory (grosz) use lingpipe to do this.
use lex chains to further do co-ref, basically np chains w/ lex chains
use ILP to do compression
use hard constraints in ILP to ensure modifiers not dropped. see page 5
must have at least one verb.
interesting part: model discourse constraints, keep centers in compression (these are used 68% of time, backed off to lex chains in 19%)
has corpus
not really a summary, just a compression
q: log linear model from mt , similar but introduce hard constraints
q: model
James Clarke, Mirella Lapata
based on word deletion
* supervised , unsupervised (hori and furui 04, charniak turner 2004)
* requires words in the same sequence
* use parallel corpus.
Results:
* decision tree version sensitive to the data, had to port model to other domain since the learned model didn't work at all (rebuilt same sentence or dropped only one word)
* not sure whether riezler et al.'s evaluation measure makes sense.
* grammaticality by parse tree?
evaluating compressions - no bleu, rouge?
perhaps relevant to slide compression, bullet building?
q: kathy: importance of context?
* full blown syntax vs. grammatical relations
* dt: model
* parse trees?
Hans Friedrich Witschel
f(d,q) and f(q,d) =
lm punish doc for absence of rare terms whereas vsm reward presence of rare terms
einat mitkov
Ariel Schwartz, Anna Divoli and Marti Hearst
nakov, schwartz and hearst 04 sigir 2004 bio.
see blunsom & cohn 2006
multiple sequence alignment (MSA)
use crfs - a score
Shingo Ono, Minoru Yoshida and Hiroshi Nakagawa
disambiguation of person names.
Get dataset from Ono. Yes, for Japanese only.
What filtering technique used? pg 344.
Pushpak Bhattacharyya
Rajat Mohanty, M Krishna, S Limaye, Anupama
high precision difficult problem english parsing -> how to claim?
interlingua based on unl (over many international groups)
semantically related sequences
srs produces pairs (like dependency parsing) as well as triples
do this by using charn00 as base and do parse tree mod
srs to unl process requires 4 knowledge bases: eg subcat frames
solving verb structure problems
parser -> parse tree -> srs -> @attribute -> relation -> UNL
believes SRL system cannot be done purely by statistics, hypothesis space too large
CD has several functions:
* F:\google_(map|photo|figure)_retriever: get images from Google
* F:\photo_urop_feature_extractor: textual feature extraction, program imageMetaExtract creates raw.data 1 line per image with text feature vector
* F:\all_vector_gen: all vector gen, java file, calls opencv, produces raw.data (similar file) with 1 line per image with image feature vector
* final vectors: contains finished training vectors of urop, hyp, and urophyp types.
- ''7 jun'' - got prelim version of urop-feature-extractor done in ruby
- ''8 jun'' - got circle, rectangle, corner and line detectors working.
problem: calculating neighbor features in ruby or opencv. what about histogram features?
- ''13 jun'' - connected to Boostexter - but still not the correct number of features
Martin (IDM director?)
1. convergence
2. content-orientation
3. user as producer
media = content + technology (content research + systems research)
Matt Jones (Swansea)
FIT lab - mobile hci, real time systems reactive systems, DL
research fellowships for DLs / HCIs
Allison Druin's international children's storytelling project
- low textual literacy, high visual literacy
- women's groups, school groups, teachers ( to do self-help dialogue )
- arch: lend cameravid-phone. go to central part and share story (projection), push to other phone
- arch: no text menus, use iconic tags
- radio knob as radio ubiq => but failed because not collab device
- collage: collab and compete
- no single user evaluation, won't talk -> do focus group with leaders.
- tags as filter buttons on side
cs.swan.ac.uk/storybank
Mobile Interaction Design, Matt Jones, and Gary Marsden
q: change of scope of electronic stories (when int'l cable tv came in)
q: bluetooth redo ?
q:
Wei Gao and Kam-Fai Wong
finds number of clusters after setting threshold for similarity
summarization application. what about directed graphs?
Depends on thresholds set for similarity.
sort of like SGT in the sense that the percolation allows clustering of distant points iff related by intermediate points.
seems to kill k means in performance. should check on whether comparable with others reported work.
30-60% wer in tele conversations, oov
with only asr about 40% text class error
affects ir less: 7-9% add error
classifcation: 20% error rate
idea: use repeated recog results to extend training...
call helpdesk routing -> topic taxonomy with associated QAs and actions
use simple n gram clustering of utterances, cull low freq clusters, arrange cluster in sizes to build hierarchy.
A*STAR - trade and industry
MOE - funding univ.
NRF - pm office - coordinate research
Next call - early-mid nov, need 5-pg white paper
innovative, use knowledge created by R&D
office is small 12-13 pax, outsource many things
2010 warren campus - CREATE centers
SAB meets 1/yr
5 strategic thrusts
1. intensify R&D spend
2. ID strategic R&D
3. balance basic and applied
4. provide resources for private sector (2/3rd)
5. strengthen linkage between public and private
Nat'l Innovation System
1. knowledge creation
2. knowledge diffusion
3. knowledg usage
Strat R&D areas (big$$$, top down)
Env and Water tech
IDM
Biomed Phase 2
NRF Research fellowships (small$$$$, bottom up)
CRP - seed fund areas for high-impact research. Funding cross-project
programs. Whole problem and not parts/aspects of problem.
Criteria
* excellent science
* maxmum societal and economic benefit
** use inspired - eg. biomed
** private/public
* track record for large grants
** coherent program not disparate projects
* program director
observations from first round
* active participation not just big names
* no funding overseas
* industry to be actively involved, up 70% costs ok
declaration of other funding
* not what is hypothetical
don't send to multiple funding agencies, they will cross check
funding
* 20% for overhead
* 10% for IP and tech transfer -> given to tech transfer office?
submission budget
* just submit direct costs cap of 10M
* 2 pg cv needed
IP Policy
* Foreground
** owned by organizations (NOT inventors)
** no assignment to 3rd parties in general
** managed and commercialized from Singapore
* Background
** free access to all collaborators if necessary
NRF
* star search??
to be hired into NUS/NTU/SMU?
Scenario-based call - specific to SG.
CREATE ? campus fo research excellence and technolgoical excellence?
bring other reearch instituions from other inst. to network w/ local
Research Centres of Excellend
Luke Zettlemoyer and Michael Collins
CCG from previous work (by same 05)
allow relaxation of word order
relaxation is counted up as soft constraint that ends up in penalty feature that goes into the log linear model
see also wong and mooney 2007 in acl
with really simple penalty model - can this be improved?
q (bob moore): data - the data is all context independent - automated with manual checking
q (): not lang indep -
q (fernando pereira) - higher order identification ?
q (jason eisner) - lattice parsing?
q - why crossed relaxation rather than creating new cats?
q - why word order in the first place?
why can't rid CCG constraints and learn from data with pref and distortion model?
Marco Pennacchiotti and Patrick Pantel
extending a lexical resource (e.g., wn) with surface relations. surface relation fillers don't have senses tagged.
* algo 1: fix one of the two terms as anchor and search for targets that are similar to the original target
* algo 2: generalize relation anchors in wordnet then clustering?
* have to disambiguate between attachment points
* seems to be a difficult problem
*BUG: look up and read expresso paper (pantel et al.)
* Video archives: with lectures and conference talks and tutorials
* Linked Anthology
* Extended anthology: with tech reports that are copyright cleared
* Open Access CL
Currently, this is divided into three subdirectories (as the original has overflown). Within each are separate numbered directories with different files.
- pdf
- ps.gz
- txt: these are not normal text files.
Kazunari Sugiyama and Manabu Okumura
idea: regulate centroid movement wrt to seed pages
get slides from them
phonetic mapping pmm - source channel model
time sensitivity of chat text - changes quickly within 1 yr
dialect sensitivity in Chinese: chat language P(C) * chat normalization P(N|C) * phonetic chat model P(P|N,C)
train both on time sensitive data and standard data.
Tetsuya Sakai, Tatsuya Uehara, Kazuo Sumita and Taishi Shimomori
* cue phrase detection: semantic role analysis (SRA) from RIAO 2004.
* mod'ed text tiling for topic seg in closed captions.
* simple evaluation so far
Barney Pell
nl search vs keyword metaphor of auto vs manual shifting
architectural considerations:
* unsupervised methods are inaccurate for even shallow syntax
* supervised methods require complex? and costly? data
* new labeled data
says that syntax is largely common all domains
- I disagree. mobile search / im / other chat is more semantics based and not syntactically composed.
- what about boostrapping? and why is it costly and complex?
ambiguity treatment is critical
- avoid cascaded failures,
=> to me, this means that ambiguity resolution must be probabilistic -> tractable
- evaluate how reporting confidence helps
they do parsing of all of the web into a KR and then do match directly unlike most QA.
- open platform apis in development, when will they be ready?
www.powerset.com
- powerlabs / sept 2007
working with lots of good univ. people: hearst, mccallum, andrew ng, etzioni
resource query - white house ZIP code example - target substituted by instance, also keyword not actually present in document.
inversion helps for proving nil answers.
use constraints to - delta a score. does the delta take into account certainty of answer? flexibility of answers?
asking questions in enterprise search with little corpora?
dossier method looks like previous work at ISI?
question inversion like LM source / target in MT
effects of top rank in dt? probabilistic framework for all cascading details?
You can now link to [[external sites|http://www.osmosoft.com]] or [[ordinary tiddlers|TiddlyWiki]] with ordinary words, without the messiness of the full URL appearing. Edit this tiddler to see how.
You can also LinkToFolders.
Ming-Hung Hsu, Ming-Feng Tsai and Hsin-Hsi
* ConceptNet from MIT Media Lab
* WordNet better for expansion
* ConceptNet better for topic diversity
* sakai: formula for discrimination: doesn't diff between queries with lots/few query terms.
Contains the proc query. It routes queries by type (called {{{qtype}}}), where types are described below. Some of these actions can be seen in the links that citeseer provides, in the url rewriting module.
These work in conjunction with the server's HttpdServerConfiguration to rewrite the urls as a query (usually back to the {{{cs}}} script).
We can see the two most common types of page accesses after a front page or redirected search. They appear quite often in the search logs kept by cs, in the LogCiteseer file.
They are:
{{{
* document: an article detail page.
* details: article details page
}}}
Let's look at a document fetch, in the DocumentQueryWalkthrough.
Other types of queries to the QueryPM module include:
{{{
* citation
* makecitecluster
* similar
* homepage
* article: the article entry page
* track
* trackdid
* author
* highlight
* deletehighlight
* trackgid
* deletetrack
* listtrack
* getemail
* setemail
* newuid
* newuidnoc
* showhighlighted
* relatedgif
* collab
* cocitation
* active
* textsim
* check
* rawtext
* rawcitations
* cached
* pdf
* ps
* page
* correctdid
* correctabstract
* correctcid
* correctgid
* addcomment
* comment
* samesite
* dcache
* updatecache
* docadmin
* gid
* comment
* showdoc
}}}
Bryce Allen
JASIS 40(4):246-252, 1989
Describes a study where subjects were to read a paper carefully and then answer a series of questions on the paper at a later date. The questions ("cues" in the paper) were designed in three different forms: structural, bibliographic and free-response. The cues were then mined for keywords and matched with the paper's actual keyword and index terms to assess their suitability for known-item recall.
Bibliographic cues produced the shortest entries but most targeted. Allen suggests that longer cues need to be processed to remove unnecessary and (potentially) non-matching words. Longer cues may be more useful when the search is transferred to a human intermediary.
Note that the known-item retrieval is implicit, the test wasn't actually done with a retrieval system, just evaluated on the basis of cosine similarity.
Hari Sundaram
1. acquire - through partial observations - group distribution (scent trails)
2. represent - as unstable context-sensitive, evolving
3. learnable - long tail makes distribution un-learnable.
David Marr - 3 levels of representation
Brett
* Matching Ref + Cited
* two other chapters
** Scope of citations
** Multidoc sum of citations
** context / UI
** PDF engineering: font encoding, text extraction in pdfbox
** spelling correction for pdf
Dataset
* degree of processing
** preempted by Linked Anth proposal: D0: text D1: link data D2: tools D3: webservices
** D0: PDF + text extracts using pdfbox, drive file listings using mtjoseph
** D1: mtjoseph's C->D, bpowley's R->C, min and isaac's ref segmenter
* size of corpus
** ACL corpus timestamped or using file list?
** what corrections?
** select subsets: by popularity by cites, by any cite, by presence of some figure/table, by technology to produce, by technology failure (garbled fonts)
LREC paper
* Intro (Drago, Steven)
* Data - phase D0 D1
* Plans - phase D2, D3, replacing D0 if needed
* Eval
* Call for participation/arms
* Related Work
* Conclude
Tools
* Hal Daume - recommender
* Dale's student's - HAC clusterer, UN corpus complex reference
* David Yarowskys - similar paper
* our NUS projects: slides, ui/framework, keyword/tagging, citations string
Melissa Cefkin, IBM Almaden
Design anthropology
Lucy Suchman
Ethnographic Fieldwork method: in situ, actual activity, from member's perspective (especially the context of their work ''situated'')
- observation
- interview
- self-report
- artifact inventories and analysis, thinkaloud
ex. cancer society website
the premier provider of cancer-related information
- use ethnographic research methods
* found that 1) daily life with cancer not covered 2) was not covered in website
** daily life meaning: when to tell people, dealing with hair loss, etc. daily concerns
** understand user before worrying about the website
** fieldwork 6 wks; site redesign 10wks. user-centric view
* services co-produce value
philip has oracle database with registration info hosted on sunfire, suna
use cron job to dump rel fields from oracle to local machine
use updatedb.py to place into mysql db accessible from RoR
scripts to manufacture latex versions of badges, attendance, attendee list, etc.
Todo: map paper accepted to registered attendees.
dba: sidp - production
sidt - testing
lots of generic fields to be mapped by registration chair
Matthias Jarke, X. Tung Bui and John M. Carroll
- Available at: Requirements Engineering (1998) 3:155-173
- Printed and Filed
- Relevant to: Scenario, Scenario Management
Examines three case studies of the use of scenario management: HCI, RE, and SM. Not read in detail. Posits four different views (facets) of the use of scenarios: form, content, purpose and life cycle views.
Daniel Marcu's invited talk (Tuesday)
3 types of normal search
* dijkstra-knuth
* dp-like-viterbi
* greedy - hill climbing - look for contour of search space (especially smooth search space; e.g., syntax based MT)
pcfg binarization to reduce search space.
evaluate search space with task not equivalent to search algorithm.
input: bow
output: best sequence, ngram lm, syntax based lm (collins 97)
''hendra'': shuffling bag of words: model stress-testing
searn
* optimal policy -
* learning reduction -
learn and search as phd topics?
* bob - designing search spaces.
Bill Freeman
use context for faster recognition times
high resolution from low resolution
microscope for motion - motion magnification
Zhenmei Gu and Nick Cercone
* split task into segment determination and segment labeling
* use older HMM context as done in Freitag and McCallum
* document extraction redundancy: ??
Yusuke Miyao, Tomoko Ohta, Katsuya Masuda, Yoshimasa Tsuruoka, Kazuhiro Yoshida, Takashi Ninomiya and Jun’ichi Tsujii
* HPSG parser
* region algebra for representing queries.
* detailed error analysis
* much better performance, but its not clear about which helps (
Li Haizhou
2:02 - 2:20 - 2:27
Joint Source Channel model as base
interesting: use subset of chinese words for lexicon for transliteration
try using: oracle to give name, guess hard, guess soft
4-grams language model using standard letter perplexity
gender detection - harder in chinese?
- chain gender after language, why? slide 20
- what is connoted by semantics? eisenhower?
* Hu and Xu 2003 Semantic Transliteration: A Good Tradition in Translating Foreign Words. IJ Translation,
Q
- 14 alexandra -> alexander
Qinfeng Shi, Yasemin Altun, Alex Smola and S.V.N. Vishwanathan
problem: insert para breaks in running text
using semi-markov model
decompose features into
looks like a good paper to re-read
Jun Suzuki, Akinori Fujino and Hideki Isozaki
put together hmm + crfs
use hmm for unlabeled data
crf for labeled (supervised) data
combine in discriminative framework
first do supervised training of crf fixing lambda
then estimated theta for hmm
re-estimate in hybrid model?
like smith's 2005 lop-crf (log opinion pool-crf)
q (miles osbourne): features used the same?
q (fernando pereira): not best crf results?
Ann Devitt, Khurshid Ahmad
not a machine learning alg
does lexical chaining, but I'm not very clear on it at this point.
uses the chaining to weight the terms in a more principled way?
q (robert dale): position and headline extremely influences the work, did you factor this?
q: how does your lexchain alg work?
- in country particular profile for hospital - localized
-
attitudes towards evidence based practice
evidence based based practice
implementation of practices
changing of practices
end dec - get filled templates from Siti
tuesday 2-3pm feb 5 - meet with jin and telephone at NUH
siti z - assistant director - managing education and training / evidence based nursing / nursing research
heidi - operations senior manager for nursing - computer projects / hospital management
wong kok cheong - assistant director for nursing - quality / ordering / procurement /
kevin - it manager seconded from NHG
Steps in ss.
* Learn Lucene - sort of done now, got demo partially working. 1.9.1 installed into slideseer account. -- oops now 2.0
* bought ''czppt2gif'' and hooked it in through rdc on ETAP machine - cz seems to freeze for user input after the end of a run. will have to pursue them by email about this.
* Get ''pptextractor'' to interface: partial progress through Win XP Remote Desktop Connection with etap project.
* Learn correspondence rules for matching PPT to PDF - still working on this.
* Build UI - was working on this a while ago.
* Put aligner in
Progress with cs papers in ssSpider:
* 26 May: up to 35000 (that's about 5%)
* 14 June: up to 200000 (that's about 30%)
* 22 June: up to 250000 (about 35%)
* 4 July: up 380000 (past 50% now)
slideshow - detailed slideViewing
gallery - thumbnail images
galMode - turns on gallery mode (vs slideshow)
* finished doc View
* finished slide View
* random accesses
printView - should be generated by script
doc and slide view - should load js data structures from server.
''Todo:''
(done)
Let's think about the correct data structure.
slideView needs: slide # -> {para set}, slide # -> slide text
docView needs: section (not para; basically set of paras) # -> {slide set}
printView needs vertical sliding nav to major text sections.
Assume m to n mapping
printView needs script to generate slide to paragraph, with document in logical order
all need to have paper title and authors in url
surname/year/titlekeywords(omit leading stop words)
this is supported by thien an's thesis that AYT is most useful form
url should be less than 60 chars
hostname should be http://dn.tld/genre/surname/year/titlekeywords - already 15 chars
should work like quicksilver's string edit distance for keywords and surnames
left with 45 chars for remainder of url
use error document for this instead.
use permalink idea?
use 2-5 letter abbrev for genre/field
default page needs to be printView, as this page can be indexed.
Issues: how to push data to pages? (solved, yay!)
(this is now done) - check url and use url component to compose javascript to dynamically load.
ruby script to create javascript for work.
(this is now done) - printView needs to be a physical page - no metadata in script if possible. Need to rewrite the printView for that.
Priorities:
* context in title
* (done) make script to generate printView from template, with tsv input
* (mostly done - bit buggy) make script to generate js data structure from, with same tsv input
use ppt layout for slideseer
use acrobat layout for slides (left slide headers, right slide text + slides?), slide flow around?
create tools for hand creation now.
assume headers and text correctly extracted and tables and figures bounded
once this is done, go back and create model for alignment.
what alignment processes are good?
must be reversible
learn what words are aligned
for spidering
let's say about 2.2 M urls
5% have presentations?
2.2 * .05 = 110K presentation/paper pairs
1 K per day is 1/3 year
must do more than that
10K per day is 11 days
How to build the UI for annotation?
* via a automatically constructed TiddlyWiki type page? e.g., by saving a standalone HTML page. One TiddlyWiki per page. Text could be downloaded while images reside on server.
Page when loaded on computer could try to sync with server.
KLEIO
- acromine -> to do acronym disambiguation
- Ozakazi N and S Ananiadou Bioinformatics 2006
- Tsuruoka Y, McNaught J and S Ananiadou BMC Bioinformatics 2007
text mining applications next: summarization and question answering
hyphenation part or not part of name?
tagging done concurrently with
standard features for crf training
''ask yeefan to look at this'':
term mapping like yee fan's problem of entity linkage
- done using "automatic normalization" - learn rewrite rules like morphological cfg rules
/***
|''Name:''|SparklinePlugin|
|''Description:''|Sparklines macro|
***/
//{{{
if(!version.extensions.SparklinePlugin) {
version.extensions.SparklinePlugin = {installed:true};
//--
//-- Sparklines
//--
config.macros.sparkline = {};
config.macros.sparkline.handler = function(place,macroName,params)
{
var data = [];
var min = 0;
var max = 0;
var v;
for(var t=0; t<params.length; t++) {
v = parseInt(params[t]);
if(v < min)
min = v;
if(v > max)
max = v;
data.push(v);
}
if(data.length < 1)
return;
var box = createTiddlyElement(place,"span",null,"sparkline",String.fromCharCode(160));
box.title = data.join(",");
var w = box.offsetWidth;
var h = box.offsetHeight;
box.style.paddingRight = (data.length * 2 - w) + "px";
box.style.position = "relative";
for(var d=0; d<data.length; d++) {
var tick = document.createElement("img");
tick.border = 0;
tick.className = "sparktick";
tick.style.position = "absolute";
tick.src = "data:image/gif,GIF89a%01%00%01%00%91%FF%00%FF%FF%FF%00%00%00%C0%C0%C0%00%00%00!%F9%04%01%00%00%02%00%2C%00%00%00%00%01%00%01%00%40%02%02T%01%00%3B";
tick.style.left = d*2 + "px";
tick.style.width = "2px";
v = Math.floor(((data[d] - min)/(max-min)) * h);
tick.style.top = (h-v) + "px";
tick.style.height = v + "px";
box.appendChild(tick);
}
};
}
//}}}
we have a ''hard'' clustering problem. bipartite.
SpectralClustering can it be used on disconnected components (e.g., isolated slides or paragraphs?) - seems so. Can force to be bipartite by connecting all S nodes to P nodes by some default value. but it looks like SpectralClustering will work even on such cases.
It's easy for us to build the Laplacian matrix L. The question is whether an arbitrary criterion can be used with it. Or whether the graph must be changed to cater for it.
we can apply SpectralClustering. However, need to modify cut criterion to guarantee that the nodes in each cut are contiguous. Add some partition function that assigns 1 if partition is contiguous, 0 if otherwise. Or better yet, define some segmentation function on contiguity between edges, taking into account local cohesion and global coherence of the segment. so not a partition function but a function that:
takes a range i,j,k where i < j < j+1 < k and i to k is an originally contiguous sequence, and j, j+1 is the cut edge.
CutCriterion included:
minCut = cut(A,B)
normalizedCut = cut(A,B) / volume(A) + cut (A,B) / volume (B) where volume is sum W~~ij~~
minMaxCut = cut(A,B) / max (A) + cut (A,B) / max (B), where max is max W~~i,j~~ within a cluster
SpectralClustering still doesn't model the probability of insertion (extra unmapped words)
ji and zha sigir 03 said that you can apply anisotropic diffusion to clarify the sim matrix. do we want to try to do this first? but it will disconnect the components but will be able to knit them back together?
Adrian Yap (DSO)
Keyword spotting in speech
asian language
broadcast news transcription (with NTU CE) / sparse language annotation resources (csail)
Byron Marshall, Rene F Reitsma and Martha N Cyr
TeachEngineering
http://www.teachengineering.com/about.php
Andrew McCallum
Author-recipient-model 2004
- add role
- add timestamps KDD 2006
- discriminative crf
topical n-gram topics
steyvers et al 2005
wallach 2006
- condition with ngram/unigram - as a unit
future work -
apply to lm problems in mt, other nlp
mine rexa data
Welcome - CAnantaram
Overview - GautamShroff
VideoCon with TcsTCO - KAnanthKrishnan
NoisyTextAnalytics - LVSubramaniam (Venkat)
MLinNLParsing - RajeevSingal
LangProcessingForNaturalInterfaceDesign - AnupamBasu
NLAnalysisForSemanticExtraction - PushpakBhattacharyya
HindiGenFromInterlingual - OmPDamani
NLInterfaceToBusinessApps - CAnantaram
BiomedIE - NaveenSivadasan
Keynote - ChrisManning
IREval - MandarMitra
CrossLingualIR - VasudevaVarma
Demos2 - PrasadPingale
* qa opinion tasks / email discussion tasks are still just opinion keyword based (+ small amounts of polarity)
* expert track / use name mentions as hyperlinks to enrich pagerank calc
* legal - baseline use simple ir / manual run does sig better
NUS QA
* event based -> tried temporal relevance (window of time with ext resource expansion)
* Human interest model for entropy of interesting issues
* noisy channel qa for blog + newswire corpus
namespace: security and public interfaces
openAJAX
zip as one file
needs a scope definition
security and other things as an abstraction layer
plugin manager - is an os-in-one-page?
tiddler as a base class instances - so that any derived class can use a different rendering system. don't rely on standard wikiword, have xml, html, txt formatting views instead of just tiddler wikiword syntax.
docbook xml -
Ying Liu
use small font to do detection of the tables
xinxin wang
numeric search in table?
Robert Dale
RW:
*early days: Doug Appelt's KAMP - introduced NP ref and ref functions; McKeown's thesis
*85-00 formal definition: unifying frameworks - subgraph construction [Krahmer et al], parameterized search [Bohnet and Dale 2005],
** unifying framework for RE gen.
** Winograd SHRDLU 71, 72
*00+ empiricism: Belz 07 STEC shared task
* NLG - Wilks - input/output for most NLG undefined/not agreed upon, but not true for RE gen.
* Fut work: RST in RE gen. -> Micro theories of ref in some specific domains.
Questions:
* Vincent: genre (IM chat), domains (law): how different in terms of reference.
* Martin: drawer domain, no relative description, how about in terms of landmark?
Percy Liang, Slav Petrov, Michael Jordan and Dan Klein
idea: to decide complexity of grammar
em doesn't solve this by itself
use HDP to group PCFG
rw: clustering of symbols to create compact grammar
rw: nonparametric grammars
3 step em.
exp phi for first step as a type of thresholding to limit symbol production
q (noah smith): like em, bumpy nonconvex? yes
q (jenny finkel): binary nodes?
Daniel T Lee
flickr (yahoo) used in tube
myspace for bands
51show.com - chinese version of myspace
facebook for schools student but restricted by school
youtube.com - bus uncle
knowledge search - using human instead of machines (reverse captcha)
driven by prestige
drive 2.0 by
- community effect (niche)
- bandwidth available (music, photos, video)
- long tail
- 2010E 1.8 b interet users expected. 35% from asia pac. (morgan stanley data)
emp on sep good from bad citizens
internet years very fast - 6 months is several trad years
ipod - do programming yourself. podcast.
also micro payment in ipods
predict - tivo in ip will be new channel
predict - netflix distr changed to broadband d/l
charge or no charge decision?
role of search - help find but also help ads
content and ad -> split revenue of ads to content creator, do price leveling by auction (bill gross)
star search
q: mobile ? adds to geo specific long tail
mobile business regulations.
Rafael Banchs
Speech to text translation system
* union set alignment better for phrase based systems - why?
* pruning
* word-ordering - target long range reordering (before alignment??), word bonus model (length normalization?) -
de Gispert and Marino 2006
- "embedded words" create a bilingual dictionary from intersection of s2t t2s models
- problem lessened when large data source given (conditioned on existence rather than probability of alignment)
- phrase based system can generate 52% of n grams
- ngram system can generate only about 37% of phrases
- ngram based is simpler and for very dialect pairs seems to be sufficient
Future work:
- discriminative alignment training; too intensive many times
- class or POS based alignment
- zh-es translation using bootstrap using interlingual that is filtered by "good" quality
Rules for TiddlyWiki. This seems to be version 2.0.10
Lower priority tasks
sum blog sum to kathy
Shroff: SusanFeldman / Autonomy
5244
* (later) metadata / identifiers lecture to be resorted
* (later) same with copyright, policy lecture
Demos
* NPIC more features
3243
* take down peace please
Systems
* mount of cte on tembusu
Projects
* fix up sms collect / email wing-eval
* squid cache for arxiv
* read CRF Galen Andrew's
* try out mallet from mccallum, ask galley about CRF other systems
Yu-Han Chang, Paul Cohen, Wesley Kerr, Daniel Hewlett and Shane Hoversten
wubbles as taught by preteens in a virtual world
standard stanford parser
learn prep, noun and adjective
using regret algorithm
www.wubble-world.com
Su Yan, Dongwon Lee
Publication Venue Ranking Mann06, Bollen05
citation-free def
* idea: use 1) best papers overall over all conf and 2) eval venue based on best papers
answ to 1 define: good/seed papers: 1) provided by seed 2) by cite count
answ to 2 define: goodness via author sim
browsing based measure - ranks from readers perspective
Yuen-Hsien Tseng, Chi-Jen Lin, Hsiu-Han Chen and Yu-I Lin
Clustering then keyword extraction.
then map to hypernym in wordnet cluster
evaluate against infomap from stanford
comparable results
Faisal Ahmad, Sebastian de la Chica, Kirsten Butcher, Tamara Sumner, James H. Martin
personalizing instruction
use knowledge maps (concept map-base visual display) to give high level overview
for plate tectonics 564 concepts / 578 relationships (1.+ links per concepts)
- algorithm creates maps on their own , but how is it mapped?
- different types of recommended content for different misconception types (eg. spatial problems need visual resources).
teachers feel that the essay and concept map side by side is most useful.
alignment
q: how about interactivity. eg. prompt user to write more about a subject, as continuous assessment, adaptve
randall davis
smart paper
synthetic diagrams - chemical diagrams
magic paper
- what's best drawn and what's best said
- uml diagrams, circuit analysis, execute physics/sim/wargame for that
The vision
- modality-agnostic/opportunistic
- non-distracting
- application: intelligent textbooks (like young lady's primer, neal stephenson)
Kuzman Ganchev and Fernando Periera
term: metric labelling problem
label propagation
figure 2 or 3 inconsistent
applied to cora dataset
Qing Li, Sung Hyon Myaeng, Yun Jin and Bo-Yeong Kang
filter web snippets using related works that should be required to be on target web page
* to get translation: freq, close, length (longer)
* really big jump in performance in korean term translation
* cross lingual translation helps in cases where multi correct answers are possible
todo: look up hungarian algorithm
q: how to get related terms for anchor? a: use tf*idf
* feature selection
* instance weighting
comprises of two talks together (acl 07 and hlt-naacl 06)
related work
* daume - feature replication acl 07
* structural correspondence learning blitzer et al 06
* ando and zhang 07
* garuana 97
* entropy minimization: granvalet and bengio 04
labeling difference p(y|x)
source distribution difference p(x)
instance removal gives better results
instance reweighting gives better results
but adding both together doesn't give better results (conflicting?)
Bruno Jedynak and Damianos Karakos
Helena Gao
accent problems over two or more different native / non-native accents
- vision to be commercialized into mobile devices
See-Kiong Ng
* Look at evolutionary conservation of PPI across species
* "Domain domain interaction" - domain is a part of a protein that folds independently of others
* domains are highly conserved
* multi-domain needs chaperones to catalyze reaction
* find high support, non-reducible sets of > 2 dom that are PPI
* approach (baseline) association rules
Holgar Luczak, Aachen
seven levels, from basic, individual level to macro-organizational levels
traditions / trends / visions wrt physical ergonomics, -> work structuing / classical managment, market as procurement strategies.
* linking ergonomics with business goals in the company workplace: ia conference
* reverse engineering the requirements of tools to derive an ideal model of man (cranfield man)
* rasmussen 1986 : decision step ladder
* autonomous production cell
* trend: i do it my way -> individual dictate their own growth development, skillsets are duplicated across workers, awareness of multiple ethinicities/diversity management
* T W Malone (2004) Sloan -> central to loose hierarchy of management to democratic, market-driven management
* extend understanding the product cycle as incorporating the long tail of services necessary to deal with a range of products
Sun Chengjie (Min Zhang)
need origin model then activate LM for transliteration
how to decide which to use: tried bi-gram based perplexity and n-gram
summarization measure, max ent
NEOR / NCOR data (zh -> en / en -> zh)
how big is the dataset?
current direction to integrate with name translit. currently have name origin demo.
Ying Zhao, Justin Zobel and Phil Vines
* tested on a number of corpora
* function words best
* svm worked best using kl divergence to choose
Dan Shen and Mirella Lapata
framenet based srl, global assignment using string kernel sim function that weights useing tfidf
sim function uses predicate as anchor
use a graph matching model to do global assignment of roles
soft labeling of argument to more than one role
assume question only has 1 predicate
srl only doesn't do well because of coverage of framenet.
q (steve): what about support verb constructions?
Yap P. Tan NTU
- video analysis (basketball via camera views, drowning spotting via ellipse and movement)
- shape info for pose recognition as well as sign language
- crowd behavior
Chan Yee Seng and Hwee Tou Ng
1:08 - 1:26 1:33
- 3 sent window for wsd training
- apply wsd only on match
- penalty features is negative, preferring wsd'ed phrases.
Q:
- how often does longest, most probable happen during decoding?
- what are any characteristics of the problems?
- too much text on analysis slide
give a concrete example
Ben Medlock and Ted Briscoe
kappa = .65 (speculative versus non-spec, in medline, by Light et al.)
idea: clear up and give more guidelines for annotation, gives much better kappa
non-hedges only differentiated by absence of hedge cues, but hedge cues are hard to pin down to some lexical form.
uses nb based ml in features. but not yet clear what features are -- seems just unigram features
data available on the web.
q: where do speculation occur wrt to sections?
q:
Steve DeNeefe, Kevin Knight, Wei Wang and Daniel Marcu
benefits
* syntax based allowed to carry extra constraints, non contiguity
** unaligned words get absorbed in syntax rules at higher levels
* string ones can carry nil productions
looking at phrase table coverage only
* look at important phrase
ghkm rules missing significant 1-best rules used in ats pb style mt
Mengqiu Wang, Noah A. Smith and Teruko Mitamura
Both for semantic and syntact transformation based answering
Idea is the extend with semantic transformation
P(A|Q) = P(Q|A)P(A)
first part is jeopardy model = prob of q given answer
use parse tree, dependent on the types of configuration (limited to 6; same as smith and eisner 06)
non-convex !! do conditional EM
q: why integrate alpha outside of base alignment model? how is this integration?
a: not really sure.
q: why sum over all alignments and not take max?
Current idea:
Use scheduled task in WinXP to scp stuff from cte and upload it back when finished.
* Test whether scheduler works - done, works.
* Use scheduler to copy and copy back files - done, works.
* Create new cte account for SlideSeer - in progress.
* Migrate PPT extract app to new account
* Connect to scheduler
Note: WinXP Remote Desktop Connector uses a certain port to connect. Apparently traffic on this port is blocked between research and staff segments (but I have yet to verify this).
Liqi Gao, Yu Zhang, Ting Liu and Guiping Liu
use simple wsd model.
wsd ir model doesn't work better but when combined seems to help more.
q: to what extent is queries polysemous?
(Mondays)
Jan 7 - ijcnlp / vacation / load balancer?
14 - web services for existing things?
21 - web frontend to web services
28 - editing jin, jesse, gm, yeefan for jcdl
Feb 4 - importing acl anthology data, start importing citeseer data
11 - crawler work, semi structure work
18 - connect to IA web api soap /msra nlp
25 -
Mar 3 -
10 -
17 -
24 -
31 - HYP edits
Apr 7 - US Vacation
14 - US Vacation
21 -
28 -
May 5 -
12 -
19 - late LREC
26 -
Jun 2 -
9 -
16 - mid JCDL/ACL USA
23 -
30 -
Jul 7 - end Sigir
14 -
21 -
28 -
Aug - term starts
Sep - tenure app / GH move?
Oct
Nov
Dec
parsehed parscit
Slideseer integration
webpage parse integration
keyword integration
editing integration
az integration
--
wikipedia term extraction
victor zue
4 areas - zue leads
Virtual Singapore (language learning)
http://www.csail
user:singapore507
pasword:507a
leslie kaebling
interactive aids - workshop theme
* software aides for edutainment / travel aides
* adversarial agents
* 1. probablistic representations - for all foundational technology
* 2. NLP/Vision/robotics/human interface/etc. -> "modes"
* 3. apps
"measurable progress" -> probablistic?
- will have permanent project staff, jointly supervised
- csail PI to: 60 months in five year
- Singapore PIs to go 10 per years
* Abstract - First paragraph seems unnecessary.
* ToC - You have some sections in Ch 4 that are not correctly capitalized
* Ch 1 - No mention of summarization algorithms. You need to introduce the structure of your thesis here and to make a clear statement of the hypothesis that you are going to investigate in your research.
* Ch 2 - 2.1 starts off referring to the internet but then 2.2 is totally independent and reads like an excerpt of a textbook. If it is textbook material, why include it in your thesis? I'm not sure that you need all of the terminology that you actually introduce. Section 2.3 is better but still needs lots of proofreading. However, these sections (2.1-2.3) are all preliminaries, you have yet to discuss how graphical models and random walks have been applied in related research for web and summarization.
* Ch 3 starts with some related work that may have been a part of your original paper but it doesn't belong in your thesis since you have already discussed this in Ch 2. I think you should move all the discussion of PageRank to Ch 2.
* Ch 4 evaluation doesn't seem very well connected to your methodology. It shows that the summarization works but can't be directly attributed to your use of Random Walk. It would be better to compare against another graphical method that doesn't use random walk to achieve its summarization. Finally, the time and memory efficiency of the algorithms don't seem related at all. You have to somehow weave these goals into your thesis from the start rather than keeping them independent of your main contribution and "popping" them in when necessary.
* Ch 5 on summarization - much improved from what i saw earlier. 5.2.1 and the beginning seem a bit redundant. also there are some font changes where it looks like you changed editors or you cut and paste. Make sure you write original material, don't copy from other sources -- no matter how tempting. Also, your claims are not substantiated by our testing so far -- you have only tested on news articles and we can't say that your method will work for internet articles with so many different types of authors. You need to tone down your claims so that your evaluation results support them. I can see you put some time into creating the graphs in 5.5.4, they look great. Overall, this chapter is coming together nicely -- it has all the material for the journal. What is missing is the related work and the comparison with other systems -- Text Rank and Lex Rank. Also, this chapter doesn't talk about mobile devices at all in the evaluation, so it isn't very well linked to your thesis.
* Ch 6 - I don't have much to say yet. Generally I think I would write your conclusion, introduction and abstract as the last chapters, not now. In its current state, it is missing much connection with document summarization.
The requirements for a summarization journal paper and a thesis chapter are *very* different, so don't expect to be able to re-use most of the work without substantial editing. If you try to write both at the same time you will end up not doing well at either. I suggest targeting one first and, when finished, switch to the other. Because of time pressures, you might want to finish a preliminary version of the chapter, convert it to the journal submission and come back to finalize your chapter later.
Joachim Wermter and Udo Hahn
* collocations / idioms / terminology (mostly NPs)
* ranking (MRR?)
* ''BUG'': try out wermter's assoc measure LSM (that collocs don't get modified as easily); LPM "less distributional variation" (check termhood score that slots within n grams can be subbed) -> as add'n feature for keyphrase extraction
* statistic based version: used t-test b/c supposedly the best statistic
* no error analysis done
* idea: ask emma about this: look at prev use in other terminologies (as head or modifiers)
* look at document distribution?
aclweb.org
section on organizing a conference, acl handbook, officer duties
organizers: 4 afnlp, 4 from acl -> to approve general chair
"shadow group" -> oversight
whether emnlp/conll or other conf
hotel contract / contract to go through with treasurer
local account needs to be created @ non-profit
* acl
* local committee
* local arrangements committee big (20+ people) : people for booklet, lanyards, bag, 5-10 range souvenir
* excusion during tutorials
* not just web access, ssh level, email room (if needed)
* BoF: 6 and 20 rooms for these
acl-ijnlp 2009
* share experience - 8 am mojave restaurant
* omnipage publication if to be printed locally
* 6-8 months / site visit, changing the venues.
NIE
oral corpus in chinese - preschool
data sets - 300 interviews
asian languages and cultures (ALC)
- announcements: CM and lab, emph art of programming
- scanner
- do chapt 4 first?
- walk through development do together
come up with a couple of different classes
CD tracks
mobile phone model
static mod in constant sections on pg 183
typo on 108 no readLong method
131 printLine -> println
quick check 117
question with template and free narrrative
question template
submit form to assessor, assessor returns response.
e.g. "What effect does [entity] have on [entity]?"
what [financial relationships] exist between [drug companies] and [universities]?
stephanie seneff
language teaching for chinese, english (bi-directional)
1. translation
2. eavesdropping (learning how to communicate; listen to two different agents)
3. conversation (practice)
scaffolding for language learning
preprocess: pp2post, rawText2pp, rawText2mappedSection
az: pp2az, pp2cfc
meurlin
wordsim
sentsim
ingest: spiderSingleWebPage, spiderPDF, PDF2rawText
slideseer: ppt2rawText, alignS2D
ruby load balancer connected to gmond
xmltape arcfile for metadata store? Or at least interface for it?
read oai pmh again
language dislikes synonyms - absolute synonym is rare, useless
differ on one or more dimensions - biber?
denotation, emphasis, connotation, register, evaluative
diff languages slice nuances differently (argue for words that are indepedent)
acquisition from dict: structure (diana inkpen and hirst in CL), P/R = .8/.7
commit to lisplike KB with symbol+NL phrases that are extracted
compare and contrast (sim then diff) of near syns
collocational judgment as well.
used HALogen baseline used lm 38.4 worst dataset 56.5% acc
big source of error in baseline was poor collocation evidence
meaning in
*text
*writer
*reader
* argue for usign negative evidence
* argue for words?
* argue for centralized KR/interlingual? not corpus based evidence?
* can't use web corpora for acquisition, not just verification? ken church - can't learn from text
Victor Zue
speech based interface
zue, glass, seneff
* adaptive in terms of environment, speaker, speaker's vocabulary
* smt (collins) - parallel corpora-aligned training data
* review sum (barzilay) - sentiment analysis - via voice discourse (dialogue"
review mining -> IE to database
- "read me a representive review for movie"
- "find me a good and inexpernsive thai restaurant near here"
- "what are people saying about the ipod nano"
* lecture segmentation (barzilay)
lyra - query expansion
is query a expansion or refinement or a new session
content analysis - cooccurence of NEs
webstudio - microsoft
ken burns type ad creation from web page content using block rank.
* via phonetic transliteration
* via temporal distribution (this is pretty straightforward)
* combining (using score propagation -- similar to pagerank prop)
nurse interest
offline - crawl get info from predetermined sources
star rating
mrs li
"extra bits"
8-11 pm at night run to have results ready
barry eaglestone
alberto gianni
lawrence - CIT
wang xiangyu <wang06>: mail
hci
first at nus
or at singapore nlb
uci participation
- context must play a weight
-- favor monotonic
-- long jumps log prop with short
-- must handle nils
(these two like jing's paper)
- length of text in slide should factor for uncertainty
- atomic sim should favor words in front more (likely to be titles)
- longer span should be favored.
- doesn't have to be perfect, better to have higher recall (capture context)
- a (context factor) + (1-a) (sim factor) (length confidence factor)
- use span jaccard sim (w and w/o length damping)
context = 0,1,2 sided context
- use baseline as distortion model
- use baseline as cosine model.
But both baselines don't do spans...
Use baseline for starting point for GA to explore and optimize space.
tomcat on aye installed on /home/tomcat/
webapps in /home/tomcat/apache-tomcat-5.5.17/webapps
index to be stored in /home/slideseer/index (should then be visible to tomcat user)
(done) Fix java code in luceneweb to point to slideseer documents in correct place.
Decided:
* tomcat
** put java jsp code for running just lucene interface in tomcat home /home/tomcat (you need to be root to write to this, maybe try to fix this later)
** should copy all of the modified code out to src/jsp or src/java (which needs to be replicated into rpnlpir somehow, note where it originally comes from
** I had big problems getting things to compile (hate java classpath, never works the way you want it to), so the 2nd step is really important
* put all remaining code into ~slideseer
* put indexHTML index into ~slideseer/index
Agenda:
* (done) index command uses anything with .../public_html/; tomcat jsp script replaces public_html with http://wing.comp.nus.edu.sg/~slideseer/
* need code on ss side to dump document and slides into sep html for indexing.
* fix print view to not have extra hyperlink in nav
* (done) commit latest copies of stuff on slideseer account
* (done) fix fsv to be ssv
* fix title and other metadata fetch
* fix key tsv to finished.
* need similar rewrite for slides to img thumbnails. Figure out later.
Need a timeline
* when to be deployed?
* what dataset to try?
* where dataset stored into mysql mysqlphpadmin
- k t franzi and s ananidaou "extracting nested collocations" 1996. in 16th conf. on computational linguistics.
- ________ "the c-value/nc-value domain independent method for multi-word term extraction", journal of natural language processing 6(3) 145-179 (1999).
Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering"
* Ricardo Baeza-Yates, Universidad de Chile
* Carlos Castillo, Universidad de Chile
* Mauricio Marin, Universidad de Magallanes
* Andrea Rodriguez, Universidad de ConcepciÓn
"User-Centric Web Crawling" Sandeep Pandey, CMU Christopher Olston,
WWW 2006 "Information Search and Re-access Strategies of Experienced Web Users"
* Anne Aula, University of Tampere
* Natalie Jhaveri, University of Tampere
* Mika Käki, University of Tampere
WWW 2006 Towards Practical Genre Classification of Web Documents George Ferizis Peter Bailey
Rosenberg, J.B., & Borgman, C.L. (1992). Extending the Dewey Decimal Classification via keyword clustering: The Science Library Catalog project. Proceedings of the 54th American Society for Information Science Annual Meeting, 29. October 26-29, 1992, Pittsburgh. Medford, NJ: Learned Information, 171-184.
CiteSeerX: an Architecture and Web Service Design for an Academic Document Search Engine
Huajing Li
Isaac Councill
Wang-Chien Lee
C. Lee Giles
Mining Search Engine Query Logs for Query Recommendation
Zhiyong Zhang
Olfa Nasraoui
SIGIR 2006 Question Classification with Log-Linear Models
Phil Blunsom, University of Melbourne
James Curran, Krystle Kocik, University of Sydney
SIGIR 2006 P29 - Authorship Attribution with Thousands of Candidate Authors
Moshe Koppel, Jonathan Schler, Bar Ilan University
Shlomo Argamon, Illinois Institute of Technology
SIGIR 2006 P35 - Impact of interaction design for search features in digital libraries on user searching experience
Yuelin Li, Xiangmin Zhang, Ying Zhang
Rutgers University
SIGIR 2006 P39 - Combining Fields in Known-Item Email Search
Craig Macdonald, Iadh Ounis
University of Glasgow
SIGIR 2006 P47 - Give Me Just One Highly Relevant Document: P-measure
Tetsuya Sakai
Toshiba
Author Badre, Albert.
Title Shaping Web usability : interaction design in context / Albert N. Badre.
Imprint Boston, MA : Addison-Wesley, 2002.
LOCATION CALL # STACK# STATUS
SC Books QA76.76 Dev.Bad
J. Shi, J. Malik. 2000. Normalized cuts and image segmenta-
tion. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(8):888–905.
ACL
* bing liu's tutorial materials (search on the web)
* attardi giuseppi (emnlp/conll paper), parsing
* how about getting dataset from james clarke:
http://homepages.inf.ed.ac.uk/s0460084/data/.
* data for hierarchical segmentation?
http://people.csail.mit.edu/edc/emnlp07
* shalmaneser erk and paldo 2006 free SRL
* re-read Semi-Markov Models for Sequence Segmentation
JCDL
* find tammy's contact list
* check spider to make sure dataset links are in the code
* email peter with link to talk and slideseer
* jeff pomerantz (jcdl 2006 as well as d-lib nov 2006 as well as jcdl 2007), check out
* get flux-cim dataset
* ask david about rexa spidering
* read all best paper candidates
EARLIER
* read CRF Galen Andrew's
* moves jien-chen wu et al. computational analysis of move structures in academic abstracts
Yin-Leng Theng
HCI (Dix et al. 93, Preece et al 94)
support people
user model (system image vs. mental models)
applying scenario-based design (in IPM)
* design patterns analysis to deconstruct and recreate the dialogue modeling.
* executable user models
P.S. I'd also like to add you to my group's news mailing list. This is a very low (4 emails per year) mailing list (with the subject [WINGnews]) documenting recent work from my research group, WING. If you're interested, just reply to this message and let me know, otherwise, no worries -- I'll only add you to the list if you explicitly agree to it.