Supervised Categorization
of JavaScriptTM using Program Analysis Features
|
|
|
Wei Lu and Min-Yen Kan |
|
luwei@nus.edu.sg kanmy@comp.nus.edu.sg
Computer Science Program School of Computing |
|
Singapore-MIT Alliance National University of Singapore |
|
|
Why to categorize
JavaScript?
|
|
|
Most crawlers and indexers
ignore crucial information conveyed by external programs like JavaScript™ |
|
Can we have them summarized
automatically? |
Categorization Scheme
Outline
|
|
|
|
|
Introduction |
|
Categorization Scheme |
|
Feature Extraction Techniques |
|
Related Work |
|
Our Work |
|
Lexical Analysis |
|
Syntax Analysis |
|
Metrics Analysis |
|
Object Communication Analysis |
|
Contextual Analysis |
|
Conclusion |
|
|
Related Work
|
|
|
|
|
Source code categorization:
Ugurel et.al.’s work published in 2002 |
|
Seminal work on the subject |
|
Consists of two tasks |
|
Programming Language
classification: find out the type of programming language – keywords,
bi-grams… non-interesting! |
|
Topic classification: find out
the topic related to the code - relies heavily on external resources (e.g.
README file, code header), did not analyze source code itself much |
|
Can we extract features from
source code itself? |
Text Categorization
Baseline
|
|
|
|
Vector Space Model (VSM) is
used in Text Categorization |
|
Bag of words approach |
|
|
|
|
|
Our baseline using Text
Categorization Bag-of-Word approach: accuracy 87.47% |
|
A high baseline, but still has
a gap of 12.5% to improve |
|
Also important to measure error
reduction rate |
|
|
Lexical Analysis
|
|
|
|
How to tokenize JavaScript more
reasonably? |
|
First attempt of improvement
over baseline - we believe using a compiler-based approach is more reasonable |
|
Similar to POS tags used in
Natural Language Processing , we introduce a tagset for JavaScript
tokenization/tagging |
|
The tagging process comes with
a token normalization |
|
Tags used: |
|
KEY:keyword; VAR:variable; SYM:symbol;
NUM:number; STR:string; CMT:comment; REG:regexp |
|
|
Evaluation on Lexical
Analysis
Syntax Analysis
Evaluation on Syntax
Analysis
Code Metrics Analysis
|
|
|
|
|
Metrics: measurement of program
source code |
|
Complexity: statistical
measurements, e.g. CC, IFIN |
|
We proposed our own metrics in addition to
published metrics |
Evaluation on Metrics
Analysis
Outline
|
|
|
|
|
Introduction |
|
Categorization Scheme |
|
Feature Extraction Techniques |
|
Related Work |
|
Our Work |
|
Lexical Analysis |
|
Syntax Analysis |
|
Metrics Analysis |
|
Object Communication Analysis |
|
Contextual Analysis |
|
Conclusion |
|
|
Object Communication
Analysis
|
|
|
var msg = "Welcome to this
page"; |
|
banner(0); |
|
function banner (index){ |
|
var newWin = window.open(); |
|
frm.txt.value="ok"; |
|
window.status = msg.substring(0, index); |
|
index = index++; |
|
if (index >= msg.length) index = 0; |
|
window.setTimeout("banner("+index+" ) " , 100); |
|
} |
Contextual Analysis
|
|
|
Extracting information from the
context of the enclosing web page |
|
|
Evaluation on Object
Communication and Contextual Analysis
Outline
|
|
|
|
Introduction |
|
Categorization Scheme |
|
Feature Extraction Techniques |
|
Related Work |
|
Our Work |
|
Overall Evaluation |
|
Conclusion |
|
|
Evaluation on All
Components
Outline
|
|
|
Categorization Scheme |
|
Feature Extraction Techniques |
|
Conclusion |
Contributions
|
|
|
|
Shown that program analysis can
enhance source code categorization performance |
|
Both context-free and
context-sensitive analysis |
|
Case study of JavaScript
categorization |
|
New, functionality-based
categorization |
|
Tool for feature extraction
from JavaScript |
Conclusions
|
|
|
|
Limitations: |
|
Annotator Agreement |
|
Dynamic Analysis Incompleteness |
|
Choice of Classifier |
|
|
|
Future Work |
|
Source code classification of
other languages |
|
Firefox extension / IE plug-in |
|
|
|
Dataset and system prototype
available at: |
|
http://wing.comp.nus.edu.sg/~luwei/SMART |
Question?
|
|
|
|
Dataset and system prototype
available at: |
|
http://wing.comp.nus.edu.sg/~luwei/SMART |
|
First author’s undergraduate
honours year project thesis: |
|
http://wing.comp.nus.edu.sg/publications/theses/weiLuThesis.pdf |
|
Contacts: |
|
Wei LU: luwei@nus.edu.sg |
|
Min-Yen KAN: kanmy@comp.nus.edu.sg |
Guidelines for talk
|
|
|
20 minutes |
|
5 minutes for questions |