1
|
- Wei Lu and Min-Yen Kan
- luwei@nus.edu.sg kanmy@comp.nus.edu.sg
Computer Science Program
School of
Computing
- Singapore-MIT Alliance National University of Singapore
|
2
|
- Most crawlers and indexers ignore crucial information conveyed by
external programs like JavaScript™
- Can we have them summarized automatically?
|
3
|
|
4
|
- Introduction
- Categorization Scheme
- Feature Extraction Techniques
- Related Work
- Our Work
- Lexical Analysis
- Syntax Analysis
- Metrics Analysis
- Object Communication Analysis
- Contextual Analysis
- Conclusion
|
5
|
- Source code categorization: Ugurel et.al.’s work published in 2002
- Seminal work on the subject
- Consists of two tasks
- Programming Language classification: find out the type of programming
language – keywords, bi-grams… non-interesting!
- Topic classification: find out the topic related to the code - relies
heavily on external resources (e.g. README file, code header), did not
analyze source code itself much
- Can we extract features from source code itself?
|
6
|
- Vector Space Model (VSM) is used in Text Categorization
- Bag of words approach
- Our baseline using Text Categorization Bag-of-Word approach: accuracy 87.47%
- A high baseline, but still has a gap of 12.5% to improve
- Also important to measure error reduction rate
|
7
|
- How to tokenize JavaScript more reasonably?
- First attempt of improvement over baseline - we believe using a
compiler-based approach is more reasonable
- Similar to POS tags used in Natural Language Processing , we introduce
a tagset for JavaScript tokenization/tagging
- The tagging process comes with a token normalization
- Tags used:
- KEY:keyword; VAR:variable; SYM:symbol; NUM:number; STR:string; CMT:comment;
REG:regexp
|
8
|
|
9
|
|
10
|
|
11
|
- Metrics: measurement of program source code
- Complexity: statistical measurements, e.g. CC, IFIN
- We proposed our own metrics in
addition to published metrics
|
12
|
|
13
|
- Introduction
- Categorization Scheme
- Feature Extraction Techniques
- Related Work
- Our Work
- Lexical Analysis
- Syntax Analysis
- Metrics Analysis
- Object Communication Analysis
- Contextual Analysis
- Conclusion
|
14
|
- var msg = "Welcome to this page";
- banner(0);
- function banner (index){
- var newWin = window.open();
- frm.txt.value="ok";
- window.status = msg.substring(0,
index);
- index = index++;
- if (index >= msg.length)
index = 0;
-
window.setTimeout("banner("+index+" ) " ,
100);
- }
|
15
|
- Extracting information from the context of the enclosing web page
|
16
|
|
17
|
- Introduction
- Categorization Scheme
- Feature Extraction Techniques
- Related Work
- Our Work
- Overall Evaluation
- Conclusion
|
18
|
|
19
|
- Categorization Scheme
- Feature Extraction Techniques
- Conclusion
|
20
|
- Shown that program analysis can enhance source code categorization
performance
- Both context-free and context-sensitive analysis
- Case study of JavaScript categorization
- New, functionality-based categorization
- Tool for feature extraction from JavaScript
|
21
|
- Limitations:
- Annotator Agreement
- Dynamic Analysis Incompleteness
- Choice of Classifier
- Future Work
- Source code classification of other languages
- Firefox extension / IE plug-in
- Dataset and system prototype available at:
- http://wing.comp.nus.edu.sg/~luwei/SMART
|
22
|
- Dataset and system prototype available at:
- http://wing.comp.nus.edu.sg/~luwei/SMART
- First author’s undergraduate honours year project thesis:
- http://wing.comp.nus.edu.sg/publications/theses/weiLuThesis.pdf
- Contacts:
- Wei LU: luwei@nus.edu.sg
- Min-Yen KAN: kanmy@comp.nus.edu.sg
|
23
|
- 20 minutes
- 5 minutes for questions
|