R is a powerful programming environment for data analysis. However, when dealing with big data in R, a kind of main-memory based functional programming environment, the data movement and memory swapping become the major performance bottleneck. Therefore, executing a big-data-intensive R program could be many orders of magnitude less efficient than processing the SQL query directly inside the database for dealing with the same analytic task. Although there exists a number of "parallel-R" solutions, pushing R operations down to the parallel database layer, while retaining the natural R interface and the virtual R analytics flow, remains a very competitive alternative. This has motivated us to develop the R-Vertica framework to scale-out R applications through in-DB, data-parallel analytics. In order to extend the R programming environment to the space of parallel query processing transparently to the R users, we introduce the notion of R Proxy - the R object with instance maintained in the parallel database as partitioned data sets, and schema (header) retained in the memory-based R environment. A function (such as aggregation) applied to a proxy is pushed down to the parallel database layer as SQL queries or procedures, with the query results automatically returned and converted to R objects. By providing the transparent 2-way mappings between several major types of R objects and database tables or query results, the R environment and the underlying parallel database are seamlessly integrated. The R object proxies may be created from database table schemas, in-DB operations, or the operations for persisting R objects to the database. The instances of the R proxies can be retrieved into regular R objects using SQL queries. With this framework, an R application is expressed as the analytics flow with the R objects bearing small data and the R proxies representing, but not bearing, big data. The big data are manipulated, or flow, underneath the in-memory R environment in terms of In-DB and data-parallel operations. We have implemented the proposed approach and used it to integrate several large-scale R applications with the multi-node Vertica parallel database system. Our experience illustrates the unique feature and efficiency of this R-Vertica framework. © 2012 Springer-Verlag.
CITATION STYLE
Chen, Q., Hsu, M., Wu, R., & Shan, J. (2012). R-proxy framework for in-DB data-parallel analytics. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7447 LNCS, pp. 266–280). https://doi.org/10.1007/978-3-642-32597-7_24
Mendeley helps you to discover research relevant for your work.