Benchmarking spark machine learning using BigBench

Sweta Singh

Conference Proceedings

Benchmarking spark machine learning using BigBench

Singh S

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2017) 10080 LNCS 45-60

DOI: 10.1007/978-3-319-54334-5_4

0Citations

4Readers

Get full text

Abstract

Databases such as dashDB are adding High Speed Connectors for Spark to efficiently extract large volumes of data. This allows them to be combined with other unstructured data sources and perform Machine Learning (ML) on top of it. Machine Learning is a key ingredient for such use cases. In order to assess performance of the data connectors and machine language frameworks, we sought benchmarks that have the ability to scale the size of datasets to very large volumes and apply Machine Learning algorithms. After exploring several options, we found BigBench to be a good fit. In this paper, we talk about our experiences of using BigBench with special focus on its 5 Machine Learning queries and their default implementation in Spark. We discuss on how we could improve effectiveness of BigBench for benchmarking Machine Learning by avoiding bias and inclusion of real time analytics. We also think that there is scope for improving the coverage of Machine Learning by adding more use cases like Collaborative Filtering. Lastly, we share some interesting visualization of 4 ML queries using SPSS Modeler and our experiments on different Clustering and Classification algorithms.

Author supplied keywords

Cite

CITATION STYLE

APA

Singh, S. (2017). Benchmarking spark machine learning using BigBench. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10080 LNCS, pp. 45–60). Springer Verlag. https://doi.org/10.1007/978-3-319-54334-5_4

Benchmarking spark machine learning using BigBench

Abstract

Author supplied keywords

Cite

Register to see more suggestions