A Survey of Large Scale Data Mana...
IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 13, NO. 3, THIRD QUARTER 2011 311 A Survey of Large Scale Data Management Approaches in Cloud Environments Sherif Sakr, Anna Liu, Daniel M. Batista, and Mohammad Alomari Abstract���In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data. Moreover, the recent advances in Web technology has made it easy for any user to provide and consume content of any form. This has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. Cloud computing is associated with a new paradigm for the provision of computing infrastructure. This paradigm shifts the location of this infrastructure to the network to reduce the costs associated with the management of hardware and software resources. This paper gives a comprehensive survey of numerous ap- proaches and mechanisms of deploying data-intensive applica- tions in the cloud which are gaining a lot of momentum in both research and industrial communities. We analyze the various design decisions of each approach and its suitability to support certain classes of applications and end-users. A discussion of some open issues and future challenges pertaining to scalability, consistency, economical processing of large scale data on the cloud is provided. We highlight the characteristics of the best candidate classes of applications that can be deployed in the cloud. Index Terms���Cloud Computing, Cloud Data Storage, MapRe- duce, NoSQL. I. INTRODUCTION I N THE LAST two decades, the continuous increase of computational power has produced an overwhelming flow of data which called for a paradigm shift in the computing architecture and large scale data processing mechanisms. In a speech given just a few weeks before he was lost at sea off the California coast in January 2007, Jim Gray, a database software pioneer and a Microsoft researcher, called the shift a ���fourth paradigm��� [1]. The first three paradigms were experimental, theoretical and, more recently, computa- tional science. Gray argued that the only way to cope with this paradigm is to develop a new generation of computing tools to manage, visualize and analyze the data flood. In general, the current computer architectures are increasingly imbalanced where the latency gap between multi-core CPUs and mechanical hard disks is growing every year which makes the challenges of data-intensive computing harder to overcome [2]. However, this gap is not the single problem to be addressed. Recent applications that manipulate TeraBytes Manuscript received 28 June 2010 revised 2 December 2010. S. Sakr is with University of New South Wales and National ICT Australia (NICTA), Sydney, Australia (e-mail: sherif.sakr@nicta.com.au). A. Liu is with University of New South Wales and National ICT Australia (NICTA), Sydney, Australia (e-mail: anna.liu@nicta.com.au). D. M. Batista is with the Institute of Computing, State University of Campinas, Campinas, Brazil (e-mail: batista@ic.unicamp.br). M. Alomari is with School of Information Technologies, University of Sydney, Sydney, Australia (e-mail: miomari@it.usyd.edu.au). Digital Object Identifier 10.1109/SURV.2011.032211.00087 and PetaBytes of distributed data [3] need to access networked environments with guaranteed Quality of Service (QoS). If the network mechanisms are neglected, the applications will just have access to a best effort service, which is not enough to their requirements [4]. Hence, there is a crucial need for a systematic and generic approach to tackle these problems with an architecture that can also scale into the foreseeable future. In response, Gray argued that the new trend should focus on supporting cheaper clusters of computers to manage and process all this data instead of focusing on having the biggest and fastest single computer. Unlike to previous clusters confined in local and dedicated networks, in the current clus- ters connected to the Internet, the interconnection technologies play an important role, since these clusters need to work in parallel, independent of their distances, to manipulate the data sets required by the applications [5] Currently, data set sizes for applications are growing at incredible pace. In fact, the advances in sensor technology, the increases in available bandwidth and the popularity of handheld devices that can be connected to the Internet have created an environment where even small scale applications need to store large data sets. A terabyte of data, once an unheard-of amount of information, is now commonplace. For example, modern high-energy physics experiments, such as DZero1, typically generate more than one TeraByte of data per day. With datasets growing beyond a few hundreds of terabytes, we have no off-the-shelf solutions that they can readily use to manage and analyze these data [1]. Thus, significant human and material resources were allocated to support these data-intensive operations which lead to high storage and management costs. Additionally, the recent advances in Web technology have made it easy for any user to provide and consume content of any form. For example, building a personal Web page (e.g. Google Sites2), starting a blog (e.g. WordPress3, Blogger4, LiveJournal5) and making both searchable for the public have now become a commodity. Therefore, one of the main goals of the next wave is to facilitate the job of implementing every application as a distributed, scalable and widely-accessible service on the Web. For example, it has been recently reported that the famous social network Website, Facebook6, serves 570 billion page views per month, stores 3 billion new photos every month, manages 25 billion pieces of content (e.g. status 1http://www-d0.fnal.gov/ 2http://sites.google.com/ 3http://wordpress.org/ 4http://www.blogger.com/ 5http://www.livejournal.com/ 6http://www.facebook.com/ 1553-877X/11/$25.00 �� 2011 IEEE