Creating Evolving Project Data Sets in Software Engineering

Tomasz Lewowski; Lech Madeyski

Book Chapter

Creating Evolving Project Data Sets in Software Engineering

Springer Verlag, (2020), 1-14

DOI: 10.1007/978-3-030-26574-8_1

7Citations

20Readers

Get full text

Abstract

While the amount of research in the area of software engineering is ever increasing, it is still a challenge to select a research data set. Quite a number of data sets have been proposed, but we still lack a systematic approach to creating ones that would evolve together with the industry. We aim to present a systematic method of selecting data sets of industry-relevant software projects for the purposes of software engineering research. We present a set of guidelines for filtering GitHub projects and implement those guidelines in a form of an R script. In particular, we select mostly projects from the biggest industrial open source contributors and remove projects in the first quartile in any of several categories from the data set. We use the latest GitHub GraphQL API to select the desired set of repositories. We evaluate the technique on Java projects. Presented technique systematizes methods for creating software development data sets and their evolution. Proposed algorithm has reasonable precision—between 0.65 and 0.80—and can be used as a baseline for further refinements.

Author supplied keywords

Cite

CITATION STYLE

APA

Lewowski, T., & Madeyski, L. (2020). Creating Evolving Project Data Sets in Software Engineering. In Studies in Computational Intelligence (Vol. 851, pp. 1–14). Springer Verlag. https://doi.org/10.1007/978-3-030-26574-8_1

Creating Evolving Project Data Sets in Software Engineering

Abstract

Author supplied keywords

Cite

Register to see more suggestions