‹header›

‹date/time›

Click to edit Master text styles

Second level

Third level

Fourth level

Fifth level

‹footer›

‹#›

Hi, I am Wei Lu from National University of Singapore.

I am sure everybody here knows text categorization quite well. Today, I am going to present a study on source code categorization. We will see why source code categorization is important, and how a source code categorization task can be different from conventional text categorization task.

Specifically, in this paper, we are going to present a case study on JavaScript categorization. You probably want to know why we chose to categorize JavaScript, but not other languages.

There are few reasons for that.

As we all know, JavaScript is a language that is frequently used in Web pages. Current generation of Web pages largely rely on JavaScript for achieving certain functionality. For example, form processing, pop-up advertisement, page generation, page re-direction, and so on. These information sometimes are important for end-users of the Web. Therefore, JavaScript often convey crucial information. However, these information are often ignored by most crawlers and indexers. For example, when you issue a query to Google, it will return you a list of Web pages with a brief summary, but they fail to summarize the JavaScript information on the page. So if JavaScript information can also get summarized by the search engine, it will help user to predict the usefulness of the page. Also, people can also build applications to block unwanted JavaScript on the page. So the question now becomes, can we have these information summarized automatically? One way of doing so is to categorize JavaScript codes into a set of pre-defined categories.

But what would be the categories?

We have summarized 32 functionality-based categories. This is done based on the instances selected from the WT10G corpus. The WT10G corpus is the standard Corpus used in TREC text retrieval conference evaluations. All the instances are carefully annotated manually. This is the distribution graph showing the number of instances for each category. For example, there are 264 instances annotated as “Dynamic Banner” class, 62 instances annotated as “Web Application” class, and 5 instances annotated as “Calendar”.

Next I am going to talk about various feature extraction techniques and their evaluations. We examine the related work on source code categorization first, followed by our work. Our work can be divided into two parts: context free analysis includes lexical analysis, syntax analysis and metrics analysis, while context-sensitive analysis examines JavaScript-specific contextual features.

There is a previous work on source code categorization published in 2002, which might be the seminal work on the subject of source code categorization.

Their work consists of two tasks.

In language classification, given a source code, they try to identify the type of programming language. They used keywords and bi-grams as features. This task is relatively easy and they achieved quite high performance. However, it is not relevant to our task. In topic classification task, the try to identify the topic related to the code. In their work, they largely relied on external resources like README file and code headers. Actually they did not analyze source code itself much. In this work, we are going to investigate whether good features can be extracted from source code itself for classification.

Firstly, I would like to introduce the baseline system and its performance of the task. Vector-space model is often used in conventional text categorization tasks. The basic idea is to tokenize text data into a bag of words, and this bag of words will be a feature vector representing the original text data. In the baseline system, we simply treat the JavaScript codes as plain texts and tokenize them using blanks and punctuation symbols as delimiters. Then we pass the tokens to the Weka SMO classifier to do classification. As you can see, the baseline is relatively high compared to many other classification tasks. However, there is still a gap to improve. Given such a high baseline, we also measured the error reduction rate in our evaluations to assess the effectiveness of our techniques.

In baseline system, we use blanks and punctuation symbols as delimiters to tokenize JavaScript, but is this tokenization scheme reasonable? As a first attempt of improvement over the baseline, we used a compiler-based approach to tokenize the source code, and we believe that this approach is more reasonable compared to baseline’s approach. For example, we have three statements here, the letter x appears in all of them. In the baseline, they are treated as the same token, but actually they convey different semantic meanings. In the 1st statement, x is a variable, and in the 2nd statement, x is part of a string, while in the last statement, x is a comment. We therefore introduce a tagset for JavaScript tokenization and tagging. Each code is tokenized in this way and tagged before passing to the classifier. In addition, we also employed token normalization during the tagging process. For example, 1.23e4 stands for a number, so we normalize it to 12300, similarly, bannermsg stands for banner message, so we split and expand the tokens respectively.

We have conducted an evaluation on lexical analysis. We measured the classification accuracy and error reduction rate. From the result we can see that using our lexical analysis discussed above can remove 17% of the errors. So from this we can conclude that program language features DO positively impact program categorization.

Next we try to investigate whether syntax features can enhance categorization performance. For each JavaScript source code, we parse into a tree structure. Syntax features can then get extracted based on the tree. In this particular example, a JavaScript function is parsed into a tree structure as shown below, and the syntax (structure) features can get extracted from the parse tree. For example, a syntax feature can be extracted as a level 2 sub-tree like this. Another syntax feature can be extracted as a level 3 sub-tree like this. Such syntax features are then serialized as text tokens and get passed to the classifier.

We have conducted an experiment to verify the effectiveness of using syntax features. As you can see, the results show that the more complex the syntax feature is, the more effective the categorization task it can perform. From the experiment we can conclude that syntax features alone can perform a certain level of categorization task. However, it can not outperform the simple bag-of-words baseline.

Software metrics are frequently used in software maintenance tasks. Metrics are measurements of program source codes.

In this paper we examined two types of metrics.

Complexity metrics are actually statistical measurements of certain code properties. There are many well-known classic complexity metrics such as CC and IFIN

For example, CC is a complexity metric which measures the total cyclomatic complexity of the code. It is the total number of conditional statements plus 1. Therefore in the left example, this metric is 3.

Reuse metrics are measurements of similarities amongst codes. The similarities are typically measured using an edit distance approach. We have examined three approaches here, string-based edit distance, tree-based edit distance, as well as a token-based edit distance approach.

From the evaluation result we can observe that using standard complexity metrics alone does not work very well in the categorization task. Certain language-specific metrics are worth investigating. Complexity metrics alone can not perform categorization task well. Another finding here is that edit distance alone is not sufficient to build a good categorizer, as it can not outperform the baseline.

So far we have tried to examine various features from source codes. However, we have so far only treated source codes no more than simple static texts.

JavaScript is different from other texts in several ways.

Firstly, as a source code, it can get executed, we may examine features from execution. Secondly, JavaScript appears on Web pages, so it has context, therefore we may find features from its context.

There are typically two types of approaches in a program analysis task – static analysis and dynamic analysis.

Static analysis assumes every path is executed, while dynamic analysis will perform an actual execution.

We have examined both approaches.

With the help of these analysis, we can extract useful object communication features. For example, in the following code, “window.open()” refers to a creation of a new window, so we can extract a feature “NEW::WINDOW” from static analysis. “frm.txt” actually refers to an “INPUT” object. Here it is to set the value of an input field, so we can extract a feature “SET::INPUT.value”. Similarly, we can extract many other features from static analysis.

As for dynamic analysis, we focus on finding more advanced features that only reveals during runtime. This example code is actually designed to perform a banner task. The message appearing in the status bar changes with time, which performs like a banner. A feature “CHANGES::WINDOW.status” can therefore get extracted with dynamic analysis.

As JavaScript appears on Web pages, we also investigated whether certain helpful features can be extracted from the enclosing Web page of the JavaScript. For example, we have a Web page titled “A JavaScript Tic Tac Toe Game”. This title implicitly indicates JavaScript appearing on the page is going to be employed in a game. The button with a value “Reset” is likely used to restore the default values in the form.

We have done a combined evaluation on object communication analysis and contextual analysis, as these two features are considered context sensitive features. The result shows that dynamic analysis performs worse than static analysis. This is because dynamic analysis is incomplete compared to static analysis. Secondly, static and dynamic features alone performs poorly, but when they are combined, the performance gets improved. Lastly, the performance can be further improved when combined with contextual features.

Finally we have performed an evaluation on all components.

Compared to the baseline, using lexical analysis alone, we can reduce 17% of the errors. This indicates that a careful study of language features helps categorization. As we have found earlier, using metrics or context-sensitive features alone does not give us a good performance, but when they are coupled with lexical features the overall performance can be increased. Using the final feature set we can perform the categorization task with an accuracy 93.95%, resulting in a 52% error reduction. Our experiment also showed that syntax features are noisy when coupled with other features, we therefore excluded this feature set in subsequent evaluations. As we can see, we have many weak classifiers, which can not perform well alone, but when they are combined together, a good classifier is built. This is considered ensemble method.

In this paper, we have showed some evidence that program analysis can enhance source code categorization performance. We have proposed various feature sets and they can generally be grouped into two, namely context-free and context-sensitive features. In particular, we have examined JavaScript categorization. This is a new, functionality-based categorization task which is not done by others before, and we also developed tools for extracting features of JavaScript.

There are some limitations of our work.

Firstly, annotation agreement. There are several instances are ambiguous in their functionality and are therefore hard to assign an appropriate category. Secondly, dynamic analysis is incomplete, because only one path is selected during execution. So some important runtime features may not be able to get extracted with a single run. We plan to look into this issue in future. Thirdly, choice of classifier. We mostly used the SMO classifier provided by the Weka machine learning toolkit. We believe it may not be a best choice for some of our feature set. In future, we plan to look at source code classification of other languages and we hope we can make our system a plug-in of Web browsers. If you are interested, you can find our experimental dataset as well as the system prototype online.

Any questions?