On reconstructing and analyzing personal learning environments of scientific artifacts
Abstract
Reflecting ones scientific publications, it is observable that certain papers perform better than others, i.e. having more impact on scientific communities and being cited more often. In this paper we try to examine the circumstances of successful and average publications by reconstructing the personal learning environments (PLEs) which have led to them. Precisely we harvest the authors, the literature used, the communities addressed, and the publications and researchers citing a paper, overall leading to interesting data-sets for technology-enhanced learning (TEL) and allowing an in-depth exploration according to interesting research questions. In order to show the usefulness of our data gathering approach, we examine if important papers in a scientific community are characterized by a high degree centrality in its citation network and if older publications with a similar impact are cited more often than recently published papers.
Author-supplied keywords
On reconstructing and analyzing personal learning environments of scientific artifacts
environments of scientific artifacts
Felix Mödritscher, Barbara Krumay, Edgar Kadlec, Wolfgang Taferner
Institute for Information Systems and New Media,
Vienna University of Economics and Business
Augasse 2-6, 1090 Vienna, Austria
+43-1-31336-5277
{felix.moedritscher, barbara.krumay, edgar.kadlec}@wu.ac.at, wolfgang@taferner.org
ABSTRACT
Reflecting one’s scientific publications, it is observable that
certain papers perform better than others, i.e. having more impact
on scientific communities and being cited more often. In this
paper we try to examine the circumstances of successful and
average publications by reconstructing the personal learning
environments (PLEs) which have led to them. Precisely we
harvest the authors, the literature used, the communities
addressed, and the publications and researchers citing a paper,
overall leading to interesting data-sets for technology-enhanced
learning (TEL) and allowing an in-depth exploration according to
interesting research questions. In order to show the usefulness of
our data gathering approach, we examine if important papers in a
scientific community are characterized by a high degree centrality
in its citation network and if older publications with a similar
impact are cited more often than recently published papers.
Categories and Subject Descriptors
I.2.4 [Artificial Intelligence]: Knowledge Representation
Formalisms and Methods: semantic networks, E.2 [Data
Structures]: Data Storage Representations: linked
representations, H.2.8 [Information Systems]: Database
Applications: scientific databases.
General Terms
Algorithms, Measurement, Experimentation.
Keywords
Personal Learning Environments, Scientific Publications, PLE
Reconstruction, Citation Network, Impact, Network Analysis.
1. INTRODUCTION
When scientists start to look back on their career and want to
reflect their research work, they normally evaluate their
publications with respect to the impact in scientific communities.
Some are also analyzing access statistics of and links to their
research blogs but this aspect will not be examined closer here.
Anyhow, scientists try to figure out which communities or
businesses take up their research outcomes and how good their
papers perform in terms of popularity and impact. And they have
different possibilities to achieve this.
Since three decades the Internet has provided valuable tools for
measuring the success of publications. A first generation of web
applications for this purpose includes CiteSeer (nowadays
CiteSeerX), DBLP, the ISI Web of Knowledge or the ACM
portal, whereby these tools (except DBLP) provide a citation
indexing mechanism and allow users to check which publications
are cited by other papers [1]. Besides, various metrics – e.g.
impact factors or acceptance rates of journals and conferences [2]
– can be used to estimate the success of publications, although
such numbers are criticized from time to time [3]. Younger tools,
like Google Scholar (see http://scholar.google.com/) or the
Scholar-based desktop application “Publish or Perish” (cf.
http://www.harzing.com/pop.htm), are useful for scientists due to
the accuracy of the data and allow analyzing which papers and
researchers are citing one’s publication.
Finally, community-based approaches like Mendeley (see
http://www.mendeley.com/) build upon the publications and
metadata provided by end-users (the researchers!) but also track
real usage data through desktop application, e.g. which papers a
user has read. Although still exploring possible features and being
rather a social networking platform for researchers at this time,
Mendeley has captured over 73 million articles (metadata and in
many cases also the documents, March 2011) in about 3 years.
The potential of this bottom-up approach is outstanding due to the
amount and quality of user-given data available.
In this paper we attempt to gather accurate bibliographic data
from a high-quality source and analyze it according to a PLE-
based model, namely a special kind of citation network [4]. This
data-driven analysis of scientific publications aims at identifying
characteristics and factors of successful PLE outcomes, whereby
we want to evaluate the following assumptions: (1) The success of
a paper correlates with the growth and the typology of the citation
network (‘centrality assumption’). (2) Preferential attachment [5]
plays a role (‘rich-are-getting-richer assumption’).
Therefore, the next section elaborates our PLE-based approach,
outlines preliminaries, and gives an overview of related work
done into this direction. In section 3 we summarize a case study
dealing with the (semi-automated) reconstruction and analysis of
the personal learning environments of selected publications.
Finally, section 4 highlights the findings of the study and
discusses the approach as well as our two assumptions, before an
outlook on future work is given.
2. CONCEPTUAL APPROACH,
PRELIMINARIES, AND RELATED WORK
As mentioned before, we analyze the success of scientific
outcomes on the basis of a PLE perspective. A personal learning
environment (PLE) refers to “a set of learning tools, services, and
artifacts gathered from various contexts to be used by the
at empowering learners to design (ICT-based) environments for
their activities so that they can connect to learner networks and
collaborate on shared outcomes and acquire necessary
(professional and rich professional) competences.
In former research [8] we have elaborated the notion and the most
important concepts of PLE-based learning ecologies. Figure 1
visualizes how PLE-based collaboration looks like. Learners are
involved into different activities in which they try to achieve
personal and group goals. They use various tools to collaborate on
shared artifacts. In the context of this paper, we focus on the
outcomes of such activities, namely the publications created by
one or more scientists – and even single-authored papers could
involve other actors like reviewers or editors.
Figure 1. Example scenario for PLE-based collaboration.
On a more theoretical level and putting the learner (actor) central
stage, Klamma and Petrushyna [9] propose a model of learning
ecologies which is based on the Actor-Network Theory (ANT)
and describes five important PLE entities:
Processes: Activities carried out for educational reasons, at
workplace, or due to personal goals (e.g. a job task in a business
process, attending a course for further education, or a spare time
activity requiring the acquisition of new competences)
Media: Collection of learning resources required for or
created in these activities (e.g. the Wikipedia platform, learning
objects repository, or simply the Internet)
Artifacts: Documents and other (digital or real-world)
artifacts collaboratively created and accessed by learners (e.g.
Wiki articles or a joint paper)
Agents: Actors, no matter if humans or software (e.g. peer
learners or functionality provided via Internet)
Communities: People sharing the same environment, e.g. in
terms of having common interests, working on the same
artifacts, being connected to the same actors (e.g. a group of
learners trying to achieve a course goal or a special interest
group for a specific topic)
In the scope of this paper, the PLE related to a publication can be
described as follows. A scientific paper is an outcome of a PLE-
based activity involving several human agents with different roles
(main author, co-authors, organizer/editor, reviewers, etc.) and
using different tools (MS Word, email, conference/journal
submission system, etc.). Realistically the PLE of a paper cannot
be fully reconstructed any more, as e.g. the tools used and the
interaction sequences have not been tracked.
Thus, we could use the following information about a paper: (a)
the list of the authors, (b) the shared artifacts used (reference
section of a paper), (c) other PLE activities (publications) citing
this paper, and (d) the relation of the authors to other papers (e.g.
to analyze the impact of self-citations). Normally, a paper also
addresses one or a few scientific communities which can be
determined by the targeted journal or conference. Regarding this
issue it would be also possible to restrict the bibliographic data to
a community by (e) considering only certain conference or
journal series and differentiate between citations within and
outside this community.
To reconstruct a PLE behind a publication, we follow a similar
approach like Google’s PageRank algorithm [10]. We start with
the paper and try to retrieve the authors, the title, a unique
identifier (URI), and the scientific event (conference, journal). In
the next step, we try to find all papers used in the reference
section and the ones citing the paper, e.g. through a citation index
like CiteseerX, the ACM library, or Google Scholar. For these
papers we try to retrieve the above-mentioned metadata again.
This iterative algorithm could be stopped if citations leave a
certain community (i.e. a set of certain scientific events) or if no
new links to citing or cited papers are found any more.
Experiences from the fields of information retrieval (e.g. various
Google services) and technology-enhanced learning (e.g. the 3A
contextual ranking system [11]) indicate promising results, as
PageRank identifies and favors hubs in scale-free networks.
Implicitly we assume that citation networks follow the
characteristics of scale-free networks – degree distribution
according to power law or at least asymptotically, preferential
attachment as underlying mechanism, etc. –, which is evidenced
by related work from research e.g. on citation distribution [12].
3. RECONSTRUCTION AND ANALYSIS
OF SCIENTIFIC PLES
Against this background we have developed a method to gather
TEL data-sets for reconstructing the PLEs behind publications
and for examining circumstances of this special kind of PLE
outcomes, e.g. for measuring their impact in a scientific
community. Precisely we analyze citation networks and assume
that well-known papers are hubs in such a network (‘centrality
assumption’). Moreover we examine papers having a similar
impact within one community and assume that older papers have
collected more citation links in the meantime (‘rich-are-getting-
richer assumption’).
For our PLE reconstruction method we first had to decide where
to get the bibliographic data from. Possible sources on the Internet
are CiteseerX (CX), the ISI Web of Knowledge (WoK), the ACM
Digital Library (ACM), Google Scholar (GS), and Mendeley (M).
We evaluated these five platforms on the basis of selected
publications on well-known Web concepts (i.e. data mining, the
Semantic Web, and PageRank) and compared them according to
the number and quality of the data retrieved.
Table 1 gives an overview of this comparison. On the one hand,
the statistic confirms that the data quality of CiteseerX is poor, as
it does not contain a lot and/or valid data on the four well-known
Knowledge and the ACM Digital Library provides high-quality
bibliographic data but the coverage seems to be poor (only one
out of four papers indexed; significantly less citations than other
platforms). Mendeley is not a real citation index; it rather contains
usage data (no. readers) than citations. Yet, this metric could be
interesting and valuable, as it comprises real usage data. In sum,
we decided to use Google Scholar which contains significantly
more and more topical data-sets. Moreover, the quality of this
data is on a reasonable level, which is also backed up by other
studies, e.g. one on citation mining [13].
Table 1. Comparison of different sources for bibliographic
data (CiteseerX [CX], ISI Web of Knowledge [WoK], ACM
Digital Library [ACM], Google Scholar [GS], Mendeley [M])
according to the no. citations of four well-known papers (data
retrieved in January 2011; *) no. readers).
Publication on: CX WoK ACM GS M*)
Data mining n.a. n.a. n.a. 10700 61
Semantic Web n.a. 1159 n.a. 10709 323
PageRank (1) 1301 n.a. n.a. 3670 44
PageRank (2) 2140 n.a. 1534 7245 573
Next, we realized an algorithm to harvest the bibliographic data
on the basis of AMP technology (Apache, MySQL, PHP).
Starting with a specific paper, our approach ideally considers the
papers citied within this paper. Then, it iteratively gathers the
metadata of the papers cited by and citing this publication (i.e. the
starting paper and references used within it).
Figure 2. Iterative growth of an example citation network
created by the reconstruction algorithm.
Figure 2 sketches the core idea of the algorithm. The publication
in the middle (grey node) is the starting point. The blue nodes are
the papers citing this paper, whereby lighter blue tones indicate
that the paper was added in a later iteration (e.g. the nodes 2 and 6
are added in the second iteration, 1 is added in the third iteration).
On the contrary, the cited papers are visualized by red nodes –
again lighter tones describe papers added after several iterations
of our algorithm. By using the information on authors and
scientific events it would be also possible to differentiate between
‘regular’ citations and self-citations as well as community-internal
and external citations. However this aspect has not been
investigated in this paper.
4. CASE STUDY AND FIRST FINDINGS
To try out this approach in praxis, we conducted a study which is
described in the following. However, as our prototype includes no
citation mining algorithm – which simply would not be possible,
as not all papers are available in full-text and such techniques
(like [13]) are restricted to one specific citation style –, we
simplified this step in the way that we have to specify the papers
cited in the starting publication manually. Thus, we only
reconstruct the citation network to the first level of the cited
papers (nodes 8, 9, and 10 in Figure 2). Following Scholar’s cited-
by links for all these papers (starting paper and the cited
publications which are specified manually) works perfectly fine
and is used for creating the citation network.
Table 2. Network size of an example paper and for different
harvesting depths, based on Google Scholar data ( *) Only 10
out of 24 cited publications are indexed by Google Scholar;
web-sites, technical documentations and little-known papers
are not in the index and thus have been ignored).
Level (depth) Is citing Cited by Sum
0 10*) 22 32
1 1559 131 1690
2 16654 492 17146
3 184080 1490 185570
For our case study, we selected a less known publication from the
field of ‘adaptive hypermedia’ and harvested the bibliographic
data according to the method described in the last section and
over a period of 5 days in January 2011. Table 2 summarizes the
key statistics of the data retrieved and along the recursion depth
of the harvesting algorithm, whereby the ‘level 3’ (3 recursive
iterations from the starting papers) already leads to 185570 items
(i.e. publication metadata) and their semantic relations by means
of cited-by links. Figure 3 visualizes the ‘level 0’ citation
network, i.e. the starting paper plus the cited and the cited-by
publications.
Figure 3. ‘Level 0’ citation network consisting of the starting
paper and all publications being cited by it and citing it.
In Table 2, it is noticeable that the number of the cited-by links of
the cited literature grows significantly stronger than the cited-by
some of the publications cited in the starting paper or citing one
of these references are considered to be fundamental literature in
the selected scientific field while the starting paper and the ones
referring to it do not have so much impact in sum, as it and all the
citing follow-up papers are much younger.
With respect to this observation, our first assumption addresses
the relevance of a paper for a community. Precisely we assume
that key publications of a scientific community (or publications
bridging different communities) are identifiable according to their
degree centrality within such a Scholar-based citation network if
it is reconstructed on the basis of sufficient data, i.e. by at least
two iterations of our PLE reconstruction algorithm. Figure 4
evidences that the degree distribution of the ‘level 2’ citation
network reconstructed for our case study follows a power law.
Figure 4: Degree distribution of the ‘level 2’ citation network.
In the ‘level 2’ citation network (17146 nodes), we identified the
following top-5 papers on ‘adaptive hypermedia’:
Paper 1: Brusilovsky, “Methods and techniques of adaptive
hypermedia”, 1996 (1775 citation links)
Paper 2: Brusilovsky, “Adaptive hypermedia”, 2001 (1286
citation links)
Paper 3: Cheverst et al., “Developing a context-aware
electronic tourist guide: some issues and experiences”, 2000
(797 citation links)
Paper 4: De Bra et al., “AHA! An open adaptive hypermedia
architecture”, 1998 (791 citation links)
Paper 5: Kobsa et al., “Personalised hypermedia presentation
techniques for improving online customer relationships”, 2001
(565 citation links)
According to our knowledge in this field, these papers are of high
significance for the ‘adaptive hypermedia’ community.
Furthermore, our citation network creation method seems to be
applicable for more practical scenarios, like personalized
recommendions of key literature to academics who are new in a
field or the provision of information and visualizations for
reflecting on research outcomes or for identifying researcher who
build upon ideas from former publications.
Another observation from Table 2 concerns the aging of
publications. With respect to the concept of preferential
attachment [5], we assume that older publications have collected
more citation links than younger ones. Therefore we compare the
two most prominent papers according to their ‘fitness’ which can
be expressed e.g. through a paper’s citation history.
Figure 5. Citation history of paper 1 (no. citations visualized
according to the iteration levels of data harvesting)
Figure 6. Citation history of paper 2 (no. citations visualized
according to the iteration levels of data harvesting)
Figure 5 and 6 present the citation histories of these two papers. A
first issue to mention is that Google Scholar does not retrieve all
possible cited-by papers of a publication, as the number of search
results is restricted. Thus it is explainable that further iterations
lead to more citations per year, as visualized in these two figures.
Our harvesting algorithm does not only increase the size of the
citation network through each iteration but also the quality of the
data gathered. Consequently it is recommended to have more than
one iterations when reconstructing the citation network of a paper.
In addition to that the two figures also indicate that paper 1 (with
publication year 1996) has more citations than paper 2 (2001). In
principle, the rich-are-getting-richer assumption seem to be valid
although the shape of the two history curves evidences that the
second paper (the younger one!) is fitter because it has collected
more citation links in a shorter period of time and the curve has a
higher gradient as well as higher citation numbers per year. Based
on the history of both papers it is highly presumable that the paper
2 will outpace paper 1 in a few years.
Being fully in-line with the concept of preferential attachment,
this observation allows drawing the conclusion that older
links. However each network node (paper) can be characterized
through a fitness factor, which implies that there can always be a
newly published paper which exceeds the prominent publications
of a scientific community in less time. Yet the probability of such
an unexpected event is incredible low.
5. CONCLUSIONS, DISCUSSIONS, AND
FUTURE WORK
In this paper we explained how PLEs relate to scientific paper
writing and successfully publishing papers. Furthermore and with
respect to the user-centric nature of PLEs, we presented a method
to reconstruct the environment of a (user-selected) publication in
the form of the citation network. Our solution approach harvests
bibliographic data from Google Scholar and creates the citation
network iteratively. Through a case study we indicated possible
application scenarios (personalized recommendations of key
publications in scientific communities, facilities for reflecting
one’s former research strategy and outcomes) and tried to answer
two research questions concerning our approach.
Addressing the ‘centrality assumption’ we demonstrated how to
identify the most relevant publications in the field of adaptive
hypermedia, whereby we started with an arbitrary paper in this
field and examined the degree distribution. According to our
knowledge, the top-5 papers obtained through this method are
highly relevant for the community addressed. Concerning the
‘rich-are-getting-richer assumption’, the preferential attachment
mechanism is observable within the citation network created for
our case study. In principle, this assumption is also valid although
we identified that each node (paper) in the citation network can be
characterized by a fitness factor, which enables that young
publications could collect more citation links than older ones over
the same timeframe. However, work on the fitness of publications
is considered to be out of the scope of this paper.
Overall, our approach for reconstructing (parts of) PLEs behind
scientific papers leads to valuable TEL data-sets. Yet the
evaluation of this research is restricted to the field of adaptive
hypermedia only. For future work it would be necessary to
analyze citation networks for other areas as well. Furthermore it
would be nice to have a (web-based) tool allowing users to create
the citation network for any of their papers. Here, we face
restrictions given by Google – as Scholar is still in a beta phase,
there is no web service API for accessing the bibliographic data,
and automated extraction is allowed for research purposes only.
6. ACKNOWLEDGMENTS
The research leading to these results has received funding from
the EC's Seventh Framework Programme (FP7/2007-2013) under
grant agreement no 231396 (ROLE project).
7. REFERENCES
[1] Garfield, E., and Merton, R.K. 1979. Citation indexing: Its
theory and application in science, technology, and
humanities. Wiley, New York, NY.
[2] Garfield, E. 2006. The history and meaning of the journal
impact factor. Journal of the American Medical Association
295, 1, 90-93.
[3] Jeang, K.-T. 2007. Impact factor, H index, peer comparisons,
and Retrovirology: is it time to individualize citation
metrics? Retrovirology 4, 42, retrieved from
http://www.retrovirology.com/content/pdf/1742-4690-4-
42.pdf (2010-10-27).
[4] Batagelj, V. 2003. Efficient Algorithms for Citation Network
Analysis. arXiv:cs/0309023 (Sept 2003), retrieved from
http://arxiv.org/pdf/cs.DS/0309023 (2010-10-27).
[5] Barabási, A.-L., and Albert, R. 1999. Emergence of Scaling
in Random Networks. Science 286, 5439, 509-512.
[6] Henri, F., Charlier, B., and Limpens, F. 2008. Understanding
PLE as an Essential Component of the Learning Process. In
Proc. of ED-Media (Vienna, Austria, Jun 30-Jul 4, 2008).
AACE, Chesapeake, VA, 3766-3770.
[7] Van Harmelen, M. 2008. Design trajectories: Four
experiments in PLE implementation. Interactive Learning
Environments 16, 1, 35-46.
[8] Mödritscher, F., and Petrushyna, Z. 2009. Model and
Methodology for PLE-Based Collaboration in Learning
Ecologies. Deliverable D7.1/ID7.2, ROLE consortium.
[9] Klamma, R., and Petrushyna, Z. 2008. The Troll Under the
Bridge: Data Management for Huge Web Science
Mediabases. In Proc. of the 38. Jahrestagung der
Gesellschaft für Informatik e.V. (GI), die INFORMATIK
2008 (München, Germany, Sept 8-13, 2008), Köllen
Druck+Verlag GmbH, Bonn, 923-928.
[10] Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The
pagerank citation ranking: Bringing order to the web.
Technical report, Stanford Digital Library Technologies
Project.
[11] El Helou, S., Salzmann, C., Sire, S., and Gillet, D. 2009. The
3A Contextual Ranking System: Simultaneously
Recommending Actors, Assets, and Group Activities. In
Proc. of the ACM Conference On Recommender Systems
(New York, USA, Oct 22-25, 2009), ACM, New York, 373-
376.
[12] Redner, S. 1998. How Popular is Your Paper? An Empirical
Study of the Citation Distribution. European Physical
Journal B 4, 2, Springer, 131-134.
[13] Afzal, M.T., Maurer, H., Balke, W., and Kulathuramaiyer,
N. 2010. Rule based Autonomous Citation Mining with
TIERL. Journal of Digital Information Management (JDIM)
8, 3, 196-204.
[14] Bianconi, G., and Barabási, A.-L. 2001. Competition and
multiscaling in evolving networks. Europhysics Letters 54, 4
(May 2001), 436-442.
Columns on Last Page Should Be Made As Close As
Possible to Equal Length
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


