Abstract
For text clustering, there is often a dilemma: one can either first embed each examples independently and then compute pair-wise similarities based on the embeddings, or use a crossattention model that takes a pair of examples as input and produces a similarity. The former is more scalable but the similarities often have lower quality, whereas the latter does not scale well but produces higher quality similarities. We address this dilemma by developing a clustering algorithm that leverages the best of both worlds: the scalability of former and the quality of the latter. We formulate the problem of text clustering with embeddingbased and cross-attention models as a novel version of the Budgeted Correlation Clustering problem (BCC) where along with a limited number of queries to an expensive oracle (a cross-attention model in our case), we have unlimited access to a cheaper but less accurate second oracle (embedding similarities in our case). We develop a theoretically motivated algorithm that leverages the cheap oracle to judiciously query the strong oracle while maintaining high clustering quality. We empirically demonstrate gains in query minimization and clustering metrics on a variety of datasets with diverse strong and cheap oracles.
Cite
CITATION STYLE
Silwal, S., Ahmadian, S., Nystrom, A., McCallum, A., Ramachandran, D., & Kazemi, M. (2023). KwikBucks: Correlation Clustering with Cheap-Weak and Expensive-Strong Signals. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 1–31). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.sustainlp-1.1
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.