CoSQA: 20,000+ web queries for code search and question answering

67Citations
Citations of this article
101Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Finding codes given natural language query is beneficial to the productivity of software developers. Future progress towards better semantic matching between query and code requires richer supervised training resources. To remedy this, we introduce the CoSQA dataset. It includes 20,604 labels for pairs of natural language queries and codes, each annotated by at least 3 human annotators. We further introduce a contrastive learning method dubbed CoCLR to enhance query-code matching, which works as a data augmenter to bring more artificially generated training instances. We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%, and incorporating CoCLR brings a further improvement of 10.5%.

Cite

CITATION STYLE

APA

Huang, J., Tang, D., Shou, L., Gong, M., Xu, K., Jiang, D., … Duan, N. (2021). CoSQA: 20,000+ web queries for code search and question answering. In ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference (Vol. 1, pp. 5690–5700). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.acl-long.442

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free