Wei Lu and Min-Yen Kan
AIRS 2005 (Jeju Island, Korea)
5/22
Related Work
•Source code categorization: Ugurel et.al.’s work published in 2002
–Seminal work on the subject
–Consists of two tasks
•Programming Language classification: find out the type of programming language – keywords, bi-grams… non-interesting!
•Topic classification: find out the topic related to the code - relies heavily on external resources (e.g. README file, code header), did not analyze source code itself much
•Can we extract features from source code itself?
There is a previous work on source code categorization published in 2002, which might be the seminal work on the subject of source code categorization.
Their work consists of two tasks.
In language classification, given a source code, they try to identify the type of programming language. They used keywords and bi-grams as features. This task is relatively easy and they achieved quite high performance. However, it is not relevant to our task.
In topic classification task, the try to identify the topic related to the code. In their work, they largely relied on external resources like README file and code headers. Actually they did not analyze source code itself much.
In this work, we are going to investigate whether good features can be extracted from source code itself for classification.