Supervised Categorization of JavaScriptTM using Program Analysis Features
           Wei Lu            and      Min-Yen Kan
       luwei@nus.edu.sg                             kanmy@comp.nus.edu.sg
  Computer Science Program                   School of Computing
     Singapore-MIT Alliance         National University of Singapore

Why to categorize JavaScript?
Most crawlers and indexers ignore crucial information conveyed by external programs like JavaScript™
Can we have them summarized automatically?

Categorization Scheme

Outline
Introduction
Categorization Scheme
Feature Extraction Techniques
Related Work
Our Work
Lexical Analysis
Syntax Analysis
Metrics Analysis
Object Communication Analysis
Contextual Analysis
Conclusion

Related Work
Source code categorization: Ugurel et.al.’s work published in 2002
Seminal work on the subject
Consists of two tasks
Programming Language classification: find out the type of programming language – keywords, bi-grams… non-interesting!
Topic classification: find out the topic related to the code - relies heavily on external resources (e.g. README file, code header), did not analyze source code itself much
Can we extract features from source code itself?

Text Categorization Baseline
Vector Space Model (VSM) is used in Text Categorization
Bag of words approach
Our baseline using Text Categorization Bag-of-Word approach: accuracy 87.47%
A high baseline, but still has a gap of 12.5% to improve
Also important to measure error reduction rate

Lexical Analysis
How to tokenize JavaScript more reasonably?
First attempt of improvement over baseline - we believe using a compiler-based approach is more reasonable
Similar to POS tags used in Natural Language Processing , we introduce a tagset for JavaScript tokenization/tagging
The tagging process comes with a token normalization
Tags used:
KEY:keyword; VAR:variable; SYM:symbol; NUM:number; STR:string; CMT:comment; REG:regexp

Evaluation on Lexical Analysis

Syntax Analysis

Evaluation on Syntax Analysis

Code Metrics Analysis
Metrics: measurement of program source code
Complexity: statistical measurements, e.g. CC, IFIN
 We proposed our own metrics in addition to published metrics

Evaluation on Metrics Analysis

Outline
Introduction
Categorization Scheme
Feature Extraction Techniques
Related Work
Our Work
Lexical Analysis
Syntax Analysis
Metrics Analysis
Object Communication Analysis
Contextual Analysis
Conclusion

Object Communication Analysis
var msg = "Welcome to this page";
banner(0);
function banner (index){
  var newWin = window.open();
  frm.txt.value="ok";
  window.status = msg.substring(0, index);
  index = index++;
  if (index >= msg.length) index = 0;
  window.setTimeout("banner("+index+" ) " , 100);
}

Contextual Analysis
Extracting information from the context of the enclosing web page

Evaluation on Object Communication and Contextual Analysis

Outline
Introduction
Categorization Scheme
Feature Extraction Techniques
Related Work
Our Work
Overall Evaluation
Conclusion

Evaluation on All Components

Outline
Categorization Scheme
Feature Extraction Techniques
Conclusion

Contributions
Shown that program analysis can enhance source code categorization performance
Both context-free and context-sensitive analysis
Case study of JavaScript categorization
New, functionality-based categorization
Tool for feature extraction from JavaScript

Conclusions
Limitations:
Annotator Agreement
Dynamic Analysis Incompleteness
Choice of Classifier
Future Work
Source code classification of other languages
Firefox extension / IE plug-in
Dataset and system prototype available at:
http://wing.comp.nus.edu.sg/~luwei/SMART

Question?
Dataset and system prototype available at:
http://wing.comp.nus.edu.sg/~luwei/SMART
First author’s undergraduate honours year project thesis:
http://wing.comp.nus.edu.sg/publications/theses/weiLuThesis.pdf
Contacts:
Wei LU: luwei@nus.edu.sg
Min-Yen KAN: kanmy@comp.nus.edu.sg

Guidelines for talk
20 minutes
5 minutes for questions