Source code categorization: Ugurel et.al.s work published in 2002
Seminal work on the subject
Consists of two tasks
Programming Language classification: find out the type of programming language keywords,
bi-grams
non-interesting!
Topic classification: find out the topic related to the code - relies heavily on external resources
(e.g. README file, code header), did not analyze source code itself much
Can we extract features from source code itself?