Sign up & Download
Sign in

Dremel : Interactive Analysis of Web-Scale Datasets

by Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis
Proceedings of the VLDB Endowment ()

Abstract

Dremel is a scalable, interactive ad-hoc query system for analy- sis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggrega- tion queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar stor- age representation for nested records and discuss experiments on few-thousand node instances of the system

Cite this document (BETA)

Available from portal.acm.org
Page 1
hidden

Dremel : Interactive Analysis of ...

Dremel: Interactive Analysis of Web-Scale Datasets Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Google, Inc. {melnik,andrey,jlong,gromer,shiva,mtolton,theov}@google.com ABSTRACT Dremel is a scalable, interactive ad-hoc query system for analy- sis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggrega- tion queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar stor- age representation for nested records and discuss experiments on few-thousand node instances of the system. 1. INTRODUCTION Large-scale analytical data processing has become widespread in web companies and across industries, not least due to low-cost storage that enabled collecting vast amounts of business-critical data. Putting this data at the fingertips of analysts and engineers has grown increasingly important interactive response times of- ten make a qualitative difference in data exploration, monitor- ing, online customer support, rapid prototyping, debugging of data pipelines, and other tasks. Performing interactive data analysis at scale demands a high de- gree of parallelism. For example, reading one terabyte of com- pressed data in one second using today���s commodity disks would require tens of thousands of disks. Similarly, CPU-intensive queries may need to run on thousands of cores to complete within seconds. At Google, massively parallel computing is done using shared clusters of commodity machines [5]. A cluster typically hosts a multitude of distributed applications that share resources, have widely varying workloads, and run on machines with different hardware parameters. An individual worker in a distributed appli- cation may take much longer to execute a given task than others, or may never complete due to failures or preemption by the cluster management system. Hence, dealing with stragglers and failures is essential for achieving fast execution and fault tolerance [10]. The data used in web and scientific computing is often non- relational. Hence, a flexible data model is essential in these do- mains. Data structures used in programming languages, messages Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were presented at The 36th International Conference on Very Large Data Bases, September 13-17, 2010, Singapore. Proceedings of the VLDB Endowment, Vol. 3, No. 1 Copyright 2010 VLDB Endowment 2150-8097/10/09... $ 10.00. exchanged by distributed systems, structured documents, etc. lend themselves naturally to a nested representation. Normalizing and recombining such data at web scale is usually prohibitive. A nested data model underlies most of structured data processing at Google [21] and reportedly at other major web companies. This paper describes a system called Dremel1 that supports inter- active analysis of very large datasets over shared clusters of com- modity machines. Unlike traditional databases, it is capable of op- erating on in situ nested data. In situ refers to the ability to access data ���in place���, e.g., in a distributed file system (like GFS [14]) or another storage layer (e.g., Bigtable [8]). Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce (MR [12]) jobs, but at a fraction of the execution time. Dremel is not intended as a replacement for MR and is often used in conjunction with it to analyze outputs of MR pipelines or rapidly prototype larger computations. Dremel has been in production since 2006 and has thousands of users within Google. Multiple instances of Dremel are deployed in the company, ranging from tens to thousands of nodes. Examples of using the system include: ��� Analysis of crawled web documents. ��� Tracking install data for applications on Android Market. ��� Crash reporting for Google products. ��� OCR results from Google Books. ��� Spam analysis. ��� Debugging of map tiles on Google Maps. ��� Tablet migrations in managed Bigtable instances. ��� Results of tests run on Google���s distributed build system. ��� Disk I/O statistics for hundreds of thousands of disks. ��� Resource monitoring for jobs run in Google���s data centers. ��� Symbols and dependencies in Google���s codebase. Dremel builds on ideas from web search and parallel DBMSs. First, its architecture borrows the concept of a serving tree used in distributed search engines [11]. Just like a web search request, a query gets pushed down the tree and is rewritten at each step. The result of the query is assembled by aggregating the replies received from lower levels of the tree. Second, Dremel provides a high-level, SQL-like language to express ad hoc queries. In contrast to layers such as Pig [18] and Hive [16], it executes queries natively without translating them into MR jobs. Lastly, and importantly, Dremel uses a column-striped storage representation, which enables it to read less data from secondary 1Dremel is a brand of power tools that primarily rely on their speed as opposed to torque. We use this name for an internal project only.
Page 2
hidden
storage and reduce CPU cost due to cheaper compression. Column stores have been adopted for analyzing relational data [1] but to the best of our knowledge have not been extended to nested data mod- els. The columnar storage format that we present is supported by many data processing tools at Google, including MR, Sawzall [20], and FlumeJava [7]. In this paper we make the following contributions: ��� We describe a novel columnar storage format for nested data. We present algorithms for dissecting nested records into columns and reassembling them (Section 4). ��� We outline Dremel���s query language and execution. Both are designed to operate efficiently on column-striped nested data and do not require restructuring of nested records (Section 5). ��� We show how execution trees used in web search systems can be applied to database processing, and explain their benefits for answering aggregation queries efficiently (Section 6). ��� We present experiments on trillion-record, multi-terabyte datasets, conducted on system instances running on 1000- 4000 nodes (Section 7). This paper is structured as follows. In Section 2, we explain how Dremel is used for data analysis in combination with other data management tools. Its data model is presented in Section 3. The main contributions listed above are covered in Sections 4-8. Re- lated work is discussed in Section 9. Section 10 is the conclusion. 2. BACKGROUND We start by walking through a scenario that illustrates how interac- tive query processing fits into a broader data management ecosys- tem. Suppose that Alice, an engineer at Google, comes up with a novel idea for extracting new kinds of signals from web pages. She runs an MR job that cranks through the input data and produces a dataset containing the new signals, stored in billions of records in the distributed file system. To analyze the results of her experiment, she launches Dremel and executes several interactive commands: DEFINE TABLE t AS /path/to/data/* SELECT TOP(signal1, 100), COUNT(*) FROM t Her commands execute in seconds. She runs a few other queries to convince herself that her algorithm works. She finds an irregular- ity in signal1 and digs deeper by writing a FlumeJava [7] program that performs a more complex analytical computation over her out- put dataset. Once the issue is fixed, she sets up a pipeline which processes the incoming input data continuously. She formulates a few canned SQL queries that aggregate the results of her pipeline across various dimensions, and adds them to an interactive dash- board. Finally, she registers her new dataset in a catalog so other engineers can locate and query it quickly. The above scenario requires interoperation between the query processor and other data management tools. The first ingredient for that is a common storage layer. The Google File System (GFS [14]) is one such distributed storage layer widely used in the company. GFS uses replication to preserve the data despite faulty hardware and achieve fast response times in presence of stragglers. A high- performance storage layer is critical for in situ data management. It allows accessing the data without a time-consuming loading phase, which is a major impedance to database usage in analytical data processing [13], where it is often possible to run dozens of MR analyses before a DBMS is able to load the data and execute a sin- gle query. As an added benefit, data in a file system can be con- veniently manipulated using standard tools, e.g., to transfer to an- other cluster, change access privileges, or identify a subset of data for analysis based on file names. A B C D E * * * . . . record- oriented . . . r1 r2 r1 r2 r1 r2 r1 r2 column- oriented Figure 1: Record-wise vs. columnar representation of nested data The second ingredient for building interoperable data manage- ment components is a shared storage format. Columnar storage proved successful for flat relational data but making it work for Google required adapting it to a nested data model. Figure 1 illus- trates the main idea: all values of a nested field such as A.B.C are stored contiguously. Hence, A.B.C can be retrieved without read- ing A.E, A.B.D, etc. The challenge that we address is how to pre- serve all structural information and be able to reconstruct records from an arbitrary subset of fields. Next we discuss our data model, and then turn to algorithms and query processing. 3. DATA MODEL In this section we present Dremel���s data model and introduce some terminology used later. The data model originated in the context of distributed systems (which explains its name, ���Protocol Buffers��� [21]), is used widely at Google, and is available as an open source implementation. The data model is based on strongly-typed nested records. Its abstract syntax is given by: �� = dom | hA1 : ��[*|?],...,An : ��[*|?]i where �� is an atomic type or a record type. Atomic types in dom comprise integers, floating-point numbers, strings, etc. Records consist of one or multiple fields. Field i in a record has a name Ai and an optional multiplicity label. Repeated fields (*) may occur multiple times in a record. They are interpreted as lists of values, i.e., the order of field occurences in a record is significant. Optional fields (?) may be missing from the record. Otherwise, a field is required, i.e., must appear exactly once. To illustrate, consider Figure 2. It depicts a schema that defines a record type Document, representing a web document. The schema definition uses the concrete syntax from [21]. A Document has a re- quired integer DocId and optional Links, containing a list of Forward and Backward entries holding DocIds of other web pages. A docu- ment can have multiple Names, which are different URLs by which the document can be referenced. A Name contains a sequence of Code and (optional) Country pairs. Figure 2 also shows two sample records, r1 and r2, conforming to the schema. The record structure is outlined using indentation. We will use these sample records to explain the algorithms in the next sections. The fields defined in the schema form a tree hierarchy. The full path of a nested field is de- noted using the usual dotted notation, e.g., Name.Language.Code. The nested data model backs a platform-neutral, extensible mechanism for serializing structured data at Google. Code gen- eration tools produce bindings for programming languages such as C++ or Java. Cross-language interoperability is achieved using a standard binary on-the-wire representation of records, in which field values are laid out sequentially as they occur in the record. This way, a MR program written in Java can consume records from a data source exposed via a C++ library. Thus, if records are stored in a columnar representation, assembling them fast is important for interoperation with MR and other data processing tools.

Readership Statistics

400 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
26% Ph.D. Student
 
21% Student (Master)
 
17% Other Professional
by Country
 
27% United States
 
10% China
 
9% United Kingdom

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in