Supervised Categorization of JavaScript^TM using Program Analysis Features

           Wei Lu            and      Min-Yen Kan

       luwei@nus.edu.sg                             kanmy@comp.nus.edu.sg
Computer Science Program                   School of Computing

     Singapore-MIT Alliance         National University of Singapore

Why to categorize JavaScript?

Most crawlers and indexers ignore crucial information conveyed by external programs like JavaScript™

Can we have them summarized automatically?

Outline

Introduction

Categorization Scheme

Feature Extraction Techniques

Related Work

Our Work

Lexical Analysis

Syntax Analysis

Metrics Analysis

Object Communication Analysis

Contextual Analysis

Conclusion

Related Work

Source code categorization: Ugurel et.al.’s work published in 2002

Seminal work on the subject

Consists of two tasks

Programming Language classification: find out the type of programming language – keywords, bi-grams… non-interesting!

Topic classification: find out the topic related to the code - relies heavily on external resources (e.g. README file, code header), did not analyze source code itself much

Can we extract features from source code itself?

Text Categorization Baseline

Vector Space Model (VSM) is used in Text Categorization

Bag of words approach

Our baseline using Text Categorization Bag-of-Word approach: accuracy 87.47%

A high baseline, but still has a gap of 12.5% to improve

Also important to measure error reduction rate

Lexical Analysis

How to tokenize JavaScript more reasonably?

First attempt of improvement over baseline - we believe using a compiler-based approach is more reasonable

Similar to POS tags used in Natural Language Processing , we introduce a tagset for JavaScript tokenization/tagging

The tagging process comes with a token normalization

Tags used:

KEY:keyword; VAR:variable; SYM:symbol; NUM:number; STR:string; CMT:comment; REG:regexp

Evaluation on Lexical Analysis

Syntax Analysis

Evaluation on Syntax Analysis

Code Metrics Analysis

Metrics: measurement of program source code

Complexity: statistical measurements, e.g. CC, IFIN

We proposed our own metrics in addition to published metrics

Evaluation on Metrics Analysis

Object Communication Analysis

var msg = "Welcome to this page";

banner(0);

function banner (index){

var newWin = window.open();

frm.txt.value="ok";

window.status = msg.substring(0, index);

index = index++;

if (index >= msg.length) index = 0;

window.setTimeout("banner("+index+" ) " , 100);

}

Contextual Analysis

Extracting information from the context of the enclosing web page

Evaluation on Object Communication and Contextual Analysis

Outline

Introduction

Categorization Scheme

Feature Extraction Techniques

Related Work

Our Work

Overall Evaluation

Conclusion

Evaluation on All Components

Outline

Categorization Scheme

Feature Extraction Techniques

Conclusion

Contributions

Shown that program analysis can enhance source code categorization performance

Both context-free and context-sensitive analysis

Case study of JavaScript categorization

New, functionality-based categorization

Tool for feature extraction from JavaScript

Conclusions

Limitations:

Annotator Agreement

Dynamic Analysis Incompleteness

Choice of Classifier

Future Work

Source code classification of other languages

Firefox extension / IE plug-in

Dataset and system prototype available at:

http://wing.comp.nus.edu.sg/~luwei/SMART

Question?

Dataset and system prototype available at:

http://wing.comp.nus.edu.sg/~luwei/SMART

First author’s undergraduate honours year project thesis:

http://wing.comp.nus.edu.sg/publications/theses/weiLuThesis.pdf

Contacts:

Wei LU: luwei@nus.edu.sg

Min-Yen KAN: kanmy@comp.nus.edu.sg

Guidelines for talk

20 minutes

5 minutes for questions


	Wei Lu and Min-Yen Kan
	luwei@nus.edu.sg kanmy@comp.nus.edu.sg Computer Science Program School of Computing
	Singapore-MIT Alliance National University of Singapore


	Most crawlers and indexers ignore crucial information conveyed by external programs like JavaScript™
	Can we have them summarized automatically?


Introduction
Categorization Scheme
Feature Extraction Techniques
	Related Work
	Our Work
		Lexical Analysis
		Syntax Analysis
		Metrics Analysis
		Object Communication Analysis
		Contextual Analysis
Conclusion


Source code categorization: Ugurel et.al.’s work published in 2002
	Seminal work on the subject
	Consists of two tasks
		Programming Language classification: find out the type of programming language – keywords, bi-grams… non-interesting!
		Topic classification: find out the topic related to the code - relies heavily on external resources (e.g. README file, code header), did not analyze source code itself much
Can we extract features from source code itself?


	Vector Space Model (VSM) is used in Text Categorization
	Bag of words approach


	Our baseline using Text Categorization Bag-of-Word approach: accuracy 87.47%
		A high baseline, but still has a gap of 12.5% to improve
		Also important to measure error reduction rate


	How to tokenize JavaScript more reasonably?
	First attempt of improvement over baseline - we believe using a compiler-based approach is more reasonable
		Similar to POS tags used in Natural Language Processing , we introduce a tagset for JavaScript tokenization/tagging
		The tagging process comes with a token normalization
	Tags used:
	KEY:keyword; VAR:variable; SYM:symbol; NUM:number; STR:string; CMT:comment; REG:regexp


Metrics: measurement of program source code
	Complexity: statistical measurements, e.g. CC, IFIN
		We proposed our own metrics in addition to published metrics


	var msg = "Welcome to this page";
	banner(0);
	function banner (index){
	var newWin = window.open();
	frm.txt.value="ok";
	window.status = msg.substring(0, index);
	index = index++;
	if (index >= msg.length) index = 0;
	window.setTimeout("banner("+index+" ) " , 100);
	}


	Extracting information from the context of the enclosing web page


	Shown that program analysis can enhance source code categorization performance
		Both context-free and context-sensitive analysis
	Case study of JavaScript categorization
		New, functionality-based categorization
		Tool for feature extraction from JavaScript


	Limitations:
		Annotator Agreement
		Dynamic Analysis Incompleteness
		Choice of Classifier

	Future Work
		Source code classification of other languages
		Firefox extension / IE plug-in

	Dataset and system prototype available at:
	http://wing.comp.nus.edu.sg/~luwei/SMART


	Dataset and system prototype available at:
		http://wing.comp.nus.edu.sg/~luwei/SMART
	First author’s undergraduate honours year project thesis:
		http://wing.comp.nus.edu.sg/publications/theses/weiLuThesis.pdf
	Contacts:
		Wei LU: luwei@nus.edu.sg
		Min-Yen KAN: kanmy@comp.nus.edu.sg