In baseline system, we use blanks and punctuation symbols as delimiters to tokenize JavaScript, but is this tokenization scheme reasonable?
As a first attempt of improvement over the baseline, we used a compiler-based approach to tokenize the source code, and we believe that this approach is more reasonable compared to baseline’s approach.
For example, we have three statements here, the letter x appears in all of them. In the baseline, they are treated as the same token, but actually they convey different semantic meanings. In the 1st statement, x is a variable, and in the 2nd statement, x is part of a string, while in the last statement, x is a comment.
We therefore introduce a tagset for JavaScript tokenization and tagging. Each code is tokenized in this way and tagged before passing to the classifier.
In addition, we also employed token normalization during the tagging process. For example, 1.23e4 stands for a number, so we normalize it to 12300, similarly, bannermsg stands for banner message, so we split and expand the tokens respectively.