‹header›
‹date/time›
Click to edit
Master text styles
Second
level
Third
level
Fourth
level
Fifth level
‹footer›
‹#›
Hi, I am Wei Lu from National University of
Singapore.
I am sure everybody here knows text categorization
quite well.
Today, I am going to present a study on source code categorization.
We will see why source code
categorization is important, and how a source code
categorization task can be different from conventional
text categorization task.
Specifically,
in this paper, we are going to present a case study on JavaScript categorization.
You probably want to know why we chose to categorize JavaScript, but not other languages.
There are few reasons for that.
As we all know, JavaScript is a language that is
frequently used in Web pages. Current generation of
Web pages largely rely on JavaScript for achieving certain functionality. For example, form processing, pop-up advertisement, page generation, page re-direction, and so on. These information sometimes are important for end-users of the Web. Therefore, JavaScript often convey crucial information. However, these information are often ignored by most crawlers and indexers. For example, when you issue a query to Google, it will return you a list of
Web pages with a brief summary, but they fail to
summarize the JavaScript information on the page.
So if JavaScript information can also
get summarized by the search engine, it will help user to predict the usefulness of the page. Also, people can also build applications to
block unwanted JavaScript on the
page.
So the question now becomes, can we
have these information summarized
automatically?
One way of doing so is to
categorize JavaScript codes into a set of pre-defined categories.
But what would be the categories?
We have
summarized 32 functionality-based categories. This is done based on the instances selected from the WT10G corpus. The WT10G corpus is the standard Corpus used in TREC text retrieval conference evaluations. All the instances are carefully annotated manually.
This is the distribution graph showing the number of instances for each category. For example, there are 264 instances annotated as “Dynamic Banner” class, 62 instances annotated as “Web Application” class, and 5 instances annotated as “Calendar”.
Next I
am going to talk about various feature extraction techniques and their evaluations.
We examine the related work on source code categorization first, followed by our work.
Our work can be divided into two parts: context free analysis includes lexical analysis, syntax analysis and metrics analysis, while context-sensitive analysis examines JavaScript-specific contextual features.
There
is a previous work on source code categorization published in 2002, which might be the seminal work on the subject of source code categorization.
Their work consists of two tasks.
In language classification, given a source code,
they try to identify the type of programming
language. They used keywords and bi-grams as
features. This task is relatively easy
and they achieved quite high performance. However, it is not relevant to our task.
In topic classification task, the try to identify the topic related to the code. In their work, they largely relied on external resources like README file and code headers. Actually they did not analyze source code itself much.
In this work, we are going to investigate whether good features can be extracted from source code itself for classification.
Firstly,
I would like to introduce the baseline system and its performance of the task.
Vector-space
model is often used in conventional text categorization tasks. The basic idea is to tokenize text data into a bag of words, and this bag of words will be a feature vector representing the original text data.
In the baseline system, we simply treat the JavaScript codes as plain texts and tokenize them using blanks and punctuation symbols as delimiters. Then we pass the tokens to the Weka SMO classifier to do classification.
As you can see, the baseline is relatively high compared to many other classification tasks. However, there is still a gap to improve. Given such a high baseline, we also measured the error reduction rate in our evaluations to assess the effectiveness of our techniques.
In
baseline system, we use blanks and punctuation symbols as delimiters to tokenize JavaScript, but is this tokenization scheme reasonable?
As a first attempt of
improvement over the baseline, we used a compiler-based approach to tokenize the source code, and we believe that this approach is more reasonable compared to baseline’s approach.
For
example, we have three statements here, the letter x appears in all of them. In the baseline, they are treated as the same token, but actually they convey different semantic meanings. In the 1st statement, x is a variable, and in the 2nd statement, x is part of a string, while in the last statement, x is a comment.
We therefore introduce a tagset for JavaScript tokenization and tagging. Each code is tokenized in this way and tagged before passing to the classifier.
In addition, we also employed token normalization during the tagging process. For example, 1.23e4 stands for a number, so we normalize it to 12300, similarly, bannermsg stands for banner message, so we split and expand the tokens respectively.
We have
conducted an evaluation on lexical analysis. We measured the classification accuracy and error reduction rate.
From the result we can see that
using our lexical analysis discussed above can remove
17% of the errors.
So from this we can conclude
that program language features DO positively impact
program categorization.
Next we
try to investigate whether syntax features can enhance categorization performance.
For each JavaScript source code, we parse into a tree structure. Syntax features can then get extracted based on the tree.
In this particular example, a
JavaScript function is parsed into a tree structure as
shown below, and the syntax (structure) features can get
extracted from the parse tree.
For
example, a syntax feature can be extracted as a level 2 sub-tree like this.
Another
syntax feature can be extracted as a level 3 sub-tree like this.
Such syntax features are then
serialized as text tokens and get passed to the
classifier.
We have
conducted an experiment to verify the effectiveness of using syntax features.
As you can see, the results show that the more complex the syntax feature is, the more effective the categorization task it can perform.
From
the experiment we can conclude that syntax features alone can perform a certain level of categorization task. However, it can not outperform the simple bag-of-words baseline.
Software
metrics are frequently used in software maintenance tasks. Metrics are measurements of program source codes.
In this
paper we examined two types of metrics.
Complexity
metrics are actually statistical measurements of certain code properties. There are many well-known classic complexity metrics such as CC and IFIN
For
example, CC is a complexity metric which measures the total cyclomatic complexity of the code. It is the total number of conditional statements plus 1. Therefore in the left example, this metric is 3.
Reuse
metrics are measurements of similarities amongst codes. The similarities are typically measured using an edit distance approach. We have examined three approaches here, string-based edit distance, tree-based edit distance, as well as a token-based edit distance approach.
From
the evaluation result we can observe that using standard complexity metrics alone does not work very well in the categorization task.
Certain language-specific metrics are worth investigating. Complexity metrics alone can not perform categorization task well.
Another finding here is that
edit distance alone is not sufficient to build a good
categorizer, as it can not outperform the baseline.
So far
we have tried to examine various features from source codes. However, we have so far only treated source codes no more than simple static texts.
JavaScript is different from other texts in
several ways.
Firstly, as a source code, it can get executed, we
may examine features from
execution.
Secondly, JavaScript appears on Web
pages, so it has context, therefore we may find
features from its context.
There
are typically two types of approaches in a program analysis task – static analysis and dynamic analysis.
Static
analysis assumes every path is executed, while dynamic analysis will perform an actual execution.
We have
examined both approaches.
With
the help of these analysis, we can extract useful object communication features. For example, in the following code, “window.open()” refers to a creation of a new window, so we can extract a feature “NEW::WINDOW” from static analysis. “frm.txt” actually refers to an “INPUT” object. Here it is to set
the value of an input field, so we can extract a
feature “SET::INPUT.value”. Similarly, we can extract many
other features from static analysis.
As for
dynamic analysis, we focus on finding more advanced features that only reveals during runtime. This example code is actually designed to perform a banner task. The message appearing in the status bar changes with time, which performs like a banner. A feature “CHANGES::WINDOW.status” can therefore get extracted with dynamic
analysis.
As
JavaScript appears on Web pages, we also investigated whether certain helpful features can be extracted from the enclosing Web page of the JavaScript.
For example, we have a Web page titled “A JavaScript Tic Tac Toe Game”. This title implicitly indicates JavaScript appearing on the page is going to be employed in a game.
The button with a value “Reset” is likely used to restore the default values in the form.
We have
done a combined evaluation on object communication
analysis and contextual analysis, as these two features are considered context sensitive features.
The result shows that dynamic analysis performs worse than static analysis. This is because dynamic analysis is incomplete compared to static analysis. Secondly, static and dynamic features alone performs poorly, but when they are combined, the performance gets improved.
Lastly, the performance can be further improved when combined with contextual features.
Finally
we have performed an evaluation on all components.
Compared
to the baseline, using lexical analysis alone, we can reduce 17% of the errors. This indicates that a careful study of language features helps categorization.
As we have found earlier, using metrics or context-sensitive features alone does not give us a good performance, but when they are coupled with lexical features the overall performance can be increased. Using the final feature set we can perform the categorization task with an accuracy 93.95%, resulting in a 52% error reduction.
Our experiment also showed
that syntax features are noisy when
coupled with other features, we therefore excluded this feature set in subsequent evaluations.
As we can see, we have many weak classifiers, which can not perform well alone, but when they are combined together, a good classifier is built. This is considered ensemble method.
In this
paper, we have showed some evidence that program analysis can enhance source code categorization performance. We have proposed various feature sets and they can generally be grouped into two, namely context-free and context-sensitive features.
In particular, we have examined JavaScript categorization. This is a new, functionality-based categorization task which is not done by others before, and we also developed tools for extracting features of JavaScript.
There are some limitations of our work.
Firstly, annotation agreement. There are several
instances are ambiguous in their
functionality and are therefore hard to
assign an appropriate category.
Secondly,
dynamic analysis is incomplete, because only one path is selected during execution. So some important runtime features may not be able to get extracted with a single run. We plan to look into this issue in future.
Thirdly, choice of classifier. We mostly used the SMO classifier provided by the Weka machine learning toolkit. We believe it may not be a best choice for some of our feature set.
In future, we plan to look at
source code classification of other languages and we hope
we can make our system a plug-in of Web browsers.
If you are interested, you can find our experimental dataset as well as the system prototype online.
Any
questions?