Improving the tokenisation of identifier names

Simon Butler; Michel Wermelinger; Yijun Yu; Helen Sharp

Conference Proceedings

Improving the tokenisation of identifier names

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2011) 6813 LNCS 130-154

DOI: 10.1007/978-3-642-22655-7_7

54Citations

27Readers

Get full text

Abstract

Identifier names are the main vehicle for semantic information during program comprehension. Identifier names are tokenised into their semantic constituents by tools supporting program comprehension tasks, including concept location and requirements traceability. We present an approach to the automated tokenisation of identifier names that improves on existing techniques in two ways. First, it improves tokenisation accuracy for identifier names of a single case and those containing digits. Second, performance gains over existing techniques are achieved using smaller oracles. Accuracy was evaluated by comparing the output of our algorithm to manual tokenisations of 28,000 identifier names drawn from 60 open source Java projects totalling 16.5 MSLOC. We also undertook a study of the typographical features of identifier names (single case, use of digits, etc.) per object-oriented construct (class names, method names, etc.), thus providing an insight into naming conventions in industrial-scale object-oriented code. Our tokenisation tool and datasets are publicly available. © 2011 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Butler, S., Wermelinger, M., Yu, Y., & Sharp, H. (2011). Improving the tokenisation of identifier names. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6813 LNCS, pp. 130–154). https://doi.org/10.1007/978-3-642-22655-7_7

Improving the tokenisation of identifier names

Abstract

Cite

Register to see more suggestions