There is a previous work on source code categorization published in 2002, which might be the seminal work on the subject of source code categorization.
Their work consists of two tasks.
In language classification, given a source code, they try to identify the type of programming language. They used keywords and bi-grams as features. This task is relatively easy and they achieved quite high performance. However, it is not relevant to our task.
In topic classification task, the try to identify the topic related to the code. In their work, they largely relied on external resources like README file and code headers. Actually they did not analyze source code itself much.
In this work, we are going to investigate whether good features can be extracted from source code itself for classification.