Related Work
• Source code categorization: Ugurel et.al.’s work
published in 2002
– Seminal work on the subject
– Consists of two tasks
• Programming Language classification: find out the
type of programming language – keywords, bi-grams…
non-interesting!
• Topic classification: find out the topic related to the
code - relies heavily on external resources (e.g.
README file, code header), did not analyze source
code itself much
• Can we extract features from source code itself?