Reshaping text data for efficient processing on Amazon EC2

Gabriela Turcu; Ian Foster; Svetlozar Nestorov

Journal ArticleOPEN ACCESS

Reshaping text data for efficient processing on Amazon EC2

Scientific Programming (2011) 19(2-3) 133-145

DOI: 10.1155/2011/642698

1Citations

14Readers

Abstract

Text analysis tools are nowadays required to process increasingly large corpora which are often organized as small files (abstracts, news articles, etc.). Cloud computing offers a convenient, on-demand, pay-as-you-go computing environment for solving such problems. We investigate provisioning on the Amazon EC2 cloud from the user perspective, attempting to provide a scheduling strategy that is both timely and cost effective. We derive an execution plan using an empirically determined application performance model. A first goal of our performance measurements is to determine an optimal file size for our application to consume. Using the subset-sum first fit heuristic we reshape the input data by merging files in order to match as closely as possible the desired file size. This also speeds up the task of retrieving the results of our application, by having the output be less segmented. Using predictions of the performance of our application based on measurements on small data sets, we devise an execution plan that meets a user specified deadline while minimizing cost. © 2011 IOS Press and the authors. All rights reserved.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Turcu, G., Foster, I., & Nestorov, S. (2011). Reshaping text data for efficient processing on Amazon EC2. Scientific Programming, 19(2–3), 133–145. https://doi.org/10.1155/2011/642698

Readers' Seniority

PhD / Post grad / Masters / Doc 7

78%

Researcher 2

22%

Readers' Discipline

Computer Science 11

92%

Psychology 1

Reshaping text data for efficient processing on Amazon EC2

Abstract

Author supplied keywords

References Powered by Scopus

The cost of doing science on the cloud: The montage example

Cost-based scheduling of scientific workflow applications on utility grids

Amazon S3 for science grids: A viable solution?

Cited by Powered by Scopus

Special issue on science-driven cloud computing

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline