|
|
|
There is a previous
work on source code categorization published in 2002, which might be the
seminal work on the subject of source code categorization.
|
|
Their work consists
of two tasks.
|
|
In language
classification, given a source code, they try to identify the type of
programming language. They used keywords and bi-grams as features. This task
is relatively easy and they achieved quite high performance. However, it is
not relevant to our task.
|
|
In topic
classification task, the try to identify the topic related to the code. In
their work, they largely relied on external resources like README file and
code headers. Actually they did not analyze source code itself much.
|
|
In this work, we are
going to investigate whether good features can be extracted from source code
itself for classification.
|