Introducing Apache Mahout -
Introducing Apache Mahout Scalable, commercial-friendly machine learning for building intelligent applications Skill Level: Intermediate Grant Ingersoll (email@example.com) Member, Technical Staff Lucid Imagination 08 Sep 2009 Once the exclusive domain of academics and corporations with large research budgets, intelligent applications that learn from data and user input are becoming more common. The need for machine-learning techniques like clustering, collaborative filtering, and categorization has never been greater, be it for finding commonalities among large groups of people or automatically tagging large volumes of Web content. The Apache Mahout project aims to make building intelligent applications easier and faster. Mahout co-founder Grant Ingersoll introduces the basic concepts of machine learning and then demonstrates how to use Mahout to cluster documents, make recommendations, and organize content. Increasingly, the success of companies and individuals in the information age depends on how quickly and efficiently they turn vast amounts of data into actionable information. Whether it's for processing hundreds or thousands of personal e-mail messages a day or divining user intent from petabytes of weblogs, the need for tools that can organize and enhance data has never been greater. Therein lies the premise and the promise of the field of machine learning and the project this article introduces: Apache Mahout (see Resources). Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous experiences. The field is closely related to data mining and often uses techniques from statistics, probability theory, pattern recognition, and a host of other areas. Although machine learning is not a new field, it is definitely growing. Many large companies, including IBM��, Introducing Apache Mahout �� Copyright IBM Corporation 2009. All rights reserved. Page 1 of 19
Google, Amazon, Yahoo!, and Facebook, have implemented machine-learning algorithms in their applications. Many, many more companies would benefit from leveraging machine learning in their applications to learn from users and past situations. After giving a brief overview of machine-learning concepts, I'll introduce you to the Apache Mahout project's features, history, and goals. Then I'll show you how to use Mahout to do some interesting machine-learning tasks using the freely available Wikipedia data set. Machine learning 101 Machine learning uses run the gamut from game playing to fraud detection to stock-market analysis. It's used to build systems like those at Netflix and Amazon that recommend products to users based on past purchases, or systems that find all of the similar news articles on a given day. It can also be used to categorize Web pages automatically according to genre (sports, economy, war, and so on) or to mark e-mail messages as spam. The uses of machine learning are more numerous than I can cover in this article. If you're interested in exploring the field in more depth, I encourage you to refer to the Resources. Several approaches to machine learning are used to solve problems. I'll focus on the two most commonly used ones ��� supervised and unsupervised learning ��� because they are the main ones supported by Mahout. Supervised learning is tasked with learning a function from labeled training data in order to predict the value of any valid input. Common examples of supervised learning include classifying e-mail messages as spam, labeling Web pages according to their genre, and recognizing handwriting. Many algorithms are used to create supervised learners, the most common being neural networks, Support Vector Machines (SVMs), and Naive Bayes classifiers. Unsupervised learning, as you might guess, is tasked with making sense of data without any examples of what is correct or incorrect. It is most commonly used for clustering similar input into logical groups. It also can be used to reduce the number of dimensions in a data set in order to focus on only the most useful attributes, or to detect trends. Common approaches to unsupervised learning include k-Means, hierarchical clustering, and self-organizing maps. For this article, I'll focus on three specific machine-learning tasks that Mahout currently implements. They also happen to be three areas that are quite commonly used in real applications: ��� Collaborative filtering developerWorks�� ibm.com/developerWorks Introducing Apache Mahout Page 2 of 19 �� Copyright IBM Corporation 2009. All rights reserved.
��� Clustering ��� Categorization I'll take a deeper look at each of these tasks at the conceptual level before exploring their implementations in Mahout. Collaborative filtering Collaborative filtering (CF) is a technique, popularized by Amazon and others, that uses user information such as ratings, clicks, and purchases to provide recommendations to other site users. CF is often used to recommend consumer items such as books, music, and movies, but it is also used in other applications where multiple actors need to collaborate to narrow down data. Chances are you've seen CF in action on Amazon, as shown in Figure 1: Figure 1. Example of collaborative filter on Amazon Given a set of users and items, CF applications provide recommendations to the current user of the system. Four ways of generating recommendations are typical: ��� User-based: Recommend items by finding similar users. This is often harder to scale because of the dynamic nature of users. ��� Item-based: Calculate similarity between items and make recommendations. Items usually don't change much, so this often can be computed offline. ��� Slope-One: A very fast and simple item-based recommendation approach applicable when users have given ratings (and not just boolean preferences). ibm.com/developerWorks developerWorks�� Introducing Apache Mahout �� Copyright IBM Corporation 2009. All rights reserved. Page 3 of 19