The lack of large and diverse datasets of student code samples limits some forms of computer science education research. To address this problem, we created FalconCode, a novel collection of over 1.5 million Python programs from over two thousand undergraduate students at the United States Air Force Academy. FalconCode captures over five semesters worth of code samples from our introduction to computing course, which is taken by every student regardless of their academic major. The dataset contains student code submissions for over 800 programming assignments, as well as additional metadata such as the prompt for each assignment, the testcase(s) used to evaluate student submissions, and the specific skills needed to solve each problem. In this paper, we describe the methodology used to create FalconCode and the steps taken to anonymize the data. We then describe FalconCode's data schema, and show how it can support a wide range of research - -including those utilizing machine learning (ML) and artificial intelligence (AI). FalconCode is provided free-of-charge, and is available upon request for computer science education research.
CITATION STYLE
De Freitas, A., Coffman, J., De Freitas, M., Wilson, J., & Weingart, T. (2023). FalconCode: A Multiyear Dataset of Python Code Samples from an Introductory Computer Science Course. In SIGCSE 2023 - Proceedings of the 54th ACM Technical Symposium on Computer Science Education (Vol. 1, pp. 938–944). Association for Computing Machinery, Inc. https://doi.org/10.1145/3545945.3569822
Mendeley helps you to discover research relevant for your work.