Unsupervised Extreme Multi Label Classification of Stack Overflow Posts

8Citations
Citations of this article
13Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Knowing the topics of a software forum post, such as those on StackOverflow, allows for greater analysis and understanding of the large amounts of data that come from these communities. One approach to this problem is using extreme multi label classification (XMLC) to predict the topic (or 'tag') of a post from a potentially very large candidate label set. While previous work has trained these models on data which has explicit text-to-tag information, we assess the classification ability of embedding models which have not been trained using such structured data (and are thus 'unsupervised') to assess the potential applicability to other forums or domains in which tag data is not available.We evaluate 14 unsupervised pre-trained models on 0.1% of all StackOverflow posts against all 61,662 possible StackOverflow tags. We find that an MPNet model trained partially on unlabelled StackExchange data (i.e. without tag data) achieves the highest score overall for this task, with a recall score of 0.161 R@1. These results inform which models are most appropriate for use in XMLC of StackOverflow posts when supervised training is not feasible. This offers insight into these models' applicability in similar but not identical domains, such as software product forums. These results suggest that training embedding models using in-domain title-body or question-answer pairs can create an effective zero-shot topic classifier for situations where no topic data is available.

Cite

CITATION STYLE

APA

Devine, P., & Blincoe, K. (2023). Unsupervised Extreme Multi Label Classification of Stack Overflow Posts. In Proceedings - 1st International Workshop on Natural Language-Based Software Engineering, NLBSE 2022 (pp. 1–8). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1145/3528588.3528652

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free