Mendeley's Reply to the DataTEL C...
Mendeley's Reply to the DataTEL Challenge Kris Jack, James Hammerton, Dan Harvey, Jason J. Hoyt, Jan Reichelt, Paul Foeckler, and Victor Henning Mendeley Ltd., 144a Clerkenwell Road London, EC1R 5DF, United Kingdom {kris.jack, james.hammerton, dan.harvey, jason.hoyt, jan.reichelt, paul.foeckler, victor.henning}@mendeley.com Abstract. Mendeley has and continues to build a strong user community of researchers who benefit from both its desktop and web-based software. In building its community, Mendeley has recorded a considerable amount of data that can be analyzed in order to support researchers to do better research. One key area in which researchers are helped is by providing them with recommendations on research articles that they have not yet encountered but would be interested in. Recommendation system research, while being well studied in some domains, such as cinematography, lacks the kind of scientific data sets that Mendeley has been building. Mendeley has taken up the DataTEL challenge in order to provide recommendation system researchers with valuable data on users and their relationship with scientific literature. The data set has been made anonymous to protect user privacy and can only be used for non-commercial scientific purposes. Keywords: Mendeley, Recommendations, Personalization, Data Set, Scientific Articles, Research Articles. 1 Introduction Mendeley is a research platform that helps users to organize their research, collaborate with colleagues and discover new knowledge [1]. Mendeley records and analyzes a vast amount of data on a daily basis. As of October, 2010, Mendeley's user base has grown to over 550,000 researchers who have contributed 44 million articles, since being launched in the previous year. This paper presents researchers with access to data that can be used to test recommendation systems. The data has been collected primarily through analyzing research articles that users have added to Mendeley Desktop's reference management tool. To protect user privacy, the data set has been made anonymous. All of the ids that appear in the data, such as articles and user ids, do not correspond to the ids that are used in Mendeley's databases and are accessible through Mendeley's API. The data set contains just under 10% of the user profiles that have been registered with Mendeley.
2 Data Set Mendeley's data set provides information on user libraries in three files. One file includes the set of articles that appear in user libraries, while the other two provide usage-based information: one of them showing which articles users have read using Mendeley Desktop and the other showing which articles users have marked with stars using Mendeley Desktop. Mendeley's data set is intended to help researchers to test and optimize recommendation systems in the domain of scientific literature. Researchers use Mendeley Desktop and Mendeley Web to add scientific articles to their libraries. A selection of these libraries were randomly selected and entered into the data set (Table 1). The file has 50,000 user libraries that contain a total of 4,848,724 articles, 3,652,285 of them being unique. All user libraries contain at least 20 articles. The second data file provides readership information for researchers and their articles (Table 2). Using Mendeley Desktop, users can open up their articles and read them. When read, the application indicates to the user that the article has been read. This file includes the readership data for the same articles presented in the first file and indicates whether the user has used Mendeley Desktop to read them or not. 1,466,489 of the articles that appear in libraries, or 30%, have been read using Mendeley Desktop. Researchers can also make use of Mendeley Desktop to star articles that are in their libraries. This starring information is included in the third and final file, the Library Starring table (see Table 2). In the file, 615,308 of the 4,848,724 articles library entries (13%) have been starred by users. Mendeley does not put any requirements on why users should star articles. As a result, users may star articles for different reasons, making the action semantically ambiguous. 3 Obtaining the Data Mendeley's data set is available for download from the Mendeley Developer Portal (http://dev.mendeley.com/). To obtain a copy of the data, please write to datachallenge@mendeley.com with the following information: ��� Your name ��� Institutional affiliation ��� Contact details (physical address and phone number). The portal also provides an API that allows developers to gain access to much of the data that is available on the Mendeley Web. Developers should note that the user and article ids employed in the API do not correspond to the ids used in the data set to ensure user anonymity. Mendeley may contact developers if changes are required to be made to the data set. Mendeley's data set is being provided for non-commercial scientific use only.