pandera: Statistical Data Validation of Pandas Dataframes

Niels Bantilan

Conference ProceedingsOPEN ACCESS

pandera: Statistical Data Validation of Pandas Dataframes

Bantilan N

Proceedings of the 19th Python in Science Conference (2020) 116-124

DOI: 10.25080/majora-342d178e-010

N/ACitations

19Readers

Abstract

pandas is an essential tool in the data scientist's toolkit for modern data engineering, analysis, and modeling in the Python ecosystem. However, dataframes can often be difficult to reason about in terms of their data types and statistical properties as data is reshaped from its raw form to one that's ready for analysis. Here, I introduce pandera, an open source package that provides a flexible and expressive data validation API designed to make it easy for data wranglers to define dataframe schemas. These schemas execute logical and statistical assertions at runtime so that analysts can spend less time worrying about the correctness of their dataframes and more time obtaining insights and training models.

Cite

CITATION STYLE

APA

Bantilan, N. (2020). pandera: Statistical Data Validation of Pandas Dataframes. In Proceedings of the 19th Python in Science Conference (pp. 116–124). SciPy. https://doi.org/10.25080/majora-342d178e-010

pandera: Statistical Data Validation of Pandas Dataframes

Abstract

Cite

Register to see more suggestions