|
|
|
In baseline system,
we use blanks and punctuation symbols as delimiters to tokenize JavaScript,
but is this tokenization scheme reasonable?
|
|
As a first attempt
of improvement over the baseline, we used a compiler-based approach to
tokenize the source code, and we believe that this approach is more
reasonable compared to baseline’s approach.
|
|
For example, we have
three statements here, the letter x appears in all of them. In the baseline,
they are treated as the same token, but actually they convey different
semantic meanings. In the 1st statement, x is a variable, and in the 2nd
statement, x is part of a string, while in the last statement, x is a
comment.
|
|
We therefore
introduce a tagset for JavaScript tokenization and tagging. Each code is
tokenized in this way and tagged before passing to the classifier.
|
|
In addition, we also
employed token normalization during the tagging process. For example, 1.23e4
stands for a number, so we normalize it to 12300, similarly, bannermsg stands
for banner message, so we split and expand the tokens respectively.
|