How to tokenize JavaScript
more reasonably?
First attempt of improvement over
baseline - we believe using a compiler-based approach is more reasonable
Similar to POS tags
used in Natural Language Processing , we introduce a tagset for JavaScript
tokenization/tagging
The tagging process
comes with a token normalization
Tags used:
KEY:keyword; VAR:variable; SYM:symbol; NUM:number; STR:string; CMT:comment; REG:regexp
var
x = 1.23e4; var[KEY]
x[VAR] =[SYM] 12300[NUM]
;[SYM]
alert(x); alert[VAR]
([SYM]
x[STR] )[SYM] ;[SYM]
bannermsg++;//x banner[VAR]
message[VAR] ++[SYM] ;[SYM] //x[CMT]