Sign up & Download
Sign in

Diffusion of scientific credits and the ranking of scientists

by Filippo Radicchi, Santo Fortunato, Benjamin Markines, Alessandro Vespignani
Physical Review E - Statistical, Nonlinear and Soft Matter Physics (2009)

Abstract

Recently, the abundance of digital data enabled the implementation of graph based ranking algorithms that provide system level analysis for ranking publications and authors. Here we take advantage of the entire Physical Review publication archive (1893-2006) to construct authors' networks where weighted edges, as measured from opportunely normalized citation counts, define a proxy for the mechanism of scientific credit transfer. On this network we define a ranking method based on a diffusion algorithm that mimics the spreading of scientific credits on the network. We compare the results obtained with our algorithm with those obtained by local measures such as the citation count and provide a statistical analysis of the assignment of major career awards in the area of Physics. A web site where the algorithm is made available to perform customized rank analysis can be found at the address http://www.physauthorsrank.org

Cite this document (BETA)

Available from arxiv.org
Page 1
hidden

Diffusion of scientific credits and the ranking of scientists

ar
X
iv
:0
90
7.
10
50
v2
[
ph
ys
ics
.so
c-p
h]
2
3 S
ep
20
09
Diffusion of scientific credits and the ranking of scientists
Filippo Radicchi,1 Santo Fortunato,1 Benjamin Markines,2 and Alessandro Vespignani2, 1
1Complex Networks and Systems, Institute for Scientific Interchange (ISI), Torino, Italy
2Center for Complex Networks and Systems Research (CNetS),
School of Informatics and Computing, Indiana University, USA
Recently, the abundance of digital data enabled the implementation of graph based ranking al-
gorithms that provide system level analysis for ranking publications and authors. Here we take
advantage of the entire Physical Review publication archive (1893-2006) to construct authors’ net-
works where weighted edges, as measured from opportunely normalized citation counts, define a
proxy for the mechanism of scientific credit transfer. On this network we define a ranking method
based on a diffusion algorithm that mimics the spreading of scientific credits on the network. We
compare the results obtained with our algorithm with those obtained by local measures such as the
citation count and provide a statistical analysis of the assignment of major career awards in the area
of Physics. A web site where the algorithm is made available to perform customized rank analysis
can be found at the address http://www.physauthorsrank.org.
PACS numbers:
I. INTRODUCTION
Recently, the recording of social interactions and data
in the electronic format has made available datasets of
unprecedented size. This is particularly evident for bibli-
ographic data whose study has received a boost from the
information technology revolution and the digitalization
process. This has led to the definition of ranking mea-
sures which are supposed to provide objective and quan-
titative measures of the importance of journals, papers,
programs, people and disciplines [1, 2]. While the validity
of these metrics is object of debate [3], it is now standard
practice to consider measures such as the impact factor,
the number of citations and the h-index [4] to assess the
scientific research production of individuals and institu-
tions. In this context the use of multipartite networks as
the natural abstract mathematical representation of the
data is particularly convenient and several studies have
recently focused on the study of co-authorship networks,
paper citation networks, etc. [5–9]. In general, each of
these networks is an appropriate bipartite or unipartite
network projection of the original bibliographic dataset
where authors and papers are nodes and citations, au-
thorship and other bibliographic information define the
links among nodes [9, 10].
The possibility of a system level study of these net-
works has opened new possibilities for the bibliometric
analysis aimed at evaluating the impact of scientific col-
lections, publications and scholar authors. In particular,
the field has leveraged on graph based ranking algorithms
developed in the context of the World Wide Web [11–15]
to provide the impact and prestige of papers and authors.
The final goal of ranking bibliographic data is even more
ambitious as it ultimately concerns the possibility of pre-
dicting the evolution of impact and ranks on the basis of
past data [13].
Criticisms to the ranking mechanism are generally
rooted in the fact that the common indicators, like the
simple citation counts or the metrics derived from this
quantity, do not truly account for the actual merit of a
scientist. Citations have different values depending on
who is the citing scientist, defining a complicated mech-
anism of scientific credit diffusion from author to author.
Even at the simplest level, this is a very non-local process
in which scientists endorse each other through the process
of citing each other’s works. In order to take into account
this perspective, we have defined an approach that bases
the author’s ranking on a diffusion algorithm that mim-
ics the diffusion of scientific credits along time. Here we
take advantage of the set of all 407 236 papers published
between 1893 and 2006 in journals of the Physical Review
(PR) collection (see section II for a detailed description of
the set). This collection is surely an exceptional proxy of
the activity in the physical sciences and the impact that
individual scientists have generated in the field [16]. The
PR dataset has been already exploited to analyze paper
citation network and measure the impact of a specific
paper both with local (individual paper/author) metrics
(number of citations) and with graph-based ranking al-
gorithms [10, 15]. Here we propose a system level algo-
rithm with the aim of ranking authors by mimicking the
scientific credit spreading process. We first construct an
author-to-author citation network that fully accounts for
the bibliometric data relative to the credit given from any
author to other authors. We then define an appropriate
graph-based ranking algorithm that simulates the diffu-
sion of credits exchanged by the authors over the whole
network. The algorithm takes into account that citations
from more important authors have higher relevance than
citations from less important authors and the non-local
nature of the diffusion process in which any author can
in principle impact the score of far away nodes through
the diffusion process. Finally, the proposed ranking tech-
nique is compared with other commonly used methods,
which are based only on local properties of the citation
network.
The paper is organized as follows. We first give a
brief description of the PR dataset (section II). In sec-
Page 2
hidden
2tion III the weighted citation network between authors is
defined and analyzed. The description of the Science Au-
thor Rank Algorithm (SARA) is performed in section IV.
This algorithm is used for the estimation of the scien-
tific impact of physicists along time. We compare SARA
with other ranking schemes like Citation Count and Bal-
anced Citation Count in section V. In section VI, we
test SARA by using the list of the winners of the major
prizes in Physics. This list of prominent physicists is in
fact the best benchmark on which we may test our algo-
rithm. We finally conclude and report final comments in
section VII.
II. DESCRIPTION OF THE DATASET
Our database is composed of the set of all 407 236 pa-
pers published between 1893 and 2006 in journals of the
collection of Physical Review (PR). The journals consid-
ered here are Physical Review Series I, Physical Review,
Physical Review A, Physical Review B, Physical Review
C, Physical Review D, Physical Review E, Physical Re-
view Letters and Reviews of Modern Physics. For each
paper the editorial office of PR provided an xml file from
which we can extract the names of its author(s), date,
journal, volume and page of publication, its references,
the PACS [22] numbers and other additional information.
The list of references at the end of each paper allows
to construct a network of citations between papers. Ac-
cording to our database, the total number of references
(obtained by summing all references over all papers) is
9 359 556 of which 3 866 471 [23] are internal references
(i.e., references to papers appeared in PR journals).
In this work we have neglected all references of the
type “First author et al. ” and all references pointing to
papers written by authors without any publication in the
PR journals. Using these criteria, we identify 8 783 994
total references (including the 3 866 471 internal refer-
ences).
In the rest of the paper and all our analysis, we con-
sider all 8 783 994 references. As already stated, these
references include all papers, published or not in PR jour-
nals, referenced by papers published only in PR journals.
III. CONSTRUCTION OF THE WEIGHTED
AUTHOR CITATION NETWORK
A weighted citation network between authors (WACN)
can be easily determined as a particular projection of the
paper citation network (PCN) constructed by the list of
references described in section II [see Figure 1]. Consider
for instance a paper i, written by the n co-authors i1, i2,
. . . , in, which cites a paper j, written by the m co-authors
j1, j2, . . . , jm. A natural way to project the unweighted
directed link i → j between papers i and j into a WACN
is to create n · m directed connections from each of the
n citing authors to every of the m cited authors (i.e.,
Figure 1: (Color online) Projection of the PCN into a WACN.
(a) In the network of citations between papers, the article
i, written by two authors i1 and i2, cites two papers j and
k, written by one author j1 and two co-authors k1 and k2,
respectively. (b) The WACN is then simply generated by
connecting with a directed link both i1 and i2 to j1, each
with weight 1/2, and to k1 and k2, each with weight 1/4.
ik → js , ∀k = 1, . . . , n and ∀s = 1, . . . ,m), where every
connection has weight equal to wik,js = 1/ (n ·m). Given
a set of references (i.e., directed links between papers),
the weight of a directed link between two authors will be
the sum of all the weights over all the references in the
set.
It is important to stress here that while the list of refer-
ences does not have ambiguity, the analysis of the author
projection opens the issue of names disambiguation. In-
deed, common names may refer to different authors and
not all authors report their full names in publications. In
other words we could have a multiplicity of authors iden-
tified by the same identifier. In appendix A we provide a
detailed analysis of this and other related problems which
are common issues in bibliometry.
As an example of the network construction, in Figure 2
we show the WACN of the top-scientists in the field of
“complex networks”. In order to construct this network,
we first select out of the PR dataset only papers whose
titles contain keywords as “complex network”, “scale-free
network”, “small-world network”, etc. We then consider
their references and based on this list we project the PCN
into a WACN.
A. Dynamical Representation of the Weighted
Author Citation Network
In principle, a single WACN may be constructed based
on the full set of the 8 783 994 total references described
in section II. This is however not very informative as
very old citations are mixed with new ones, discounting
the dynamical information contained in the longitudinal
nature of the database. In addition, the rate of citation
per unit time is steadily increasing along the years. For
this reason, we define dynamical slices of the database
containing the same number of citations. We first sort
the full list of references according to their date (i.e.,
the date of the publication of the citing paper). Then
Page 3
hidden
3Figure 2: (Color online) We generated the citation network based on all papers published in PR journals about the topic
“complex networks”. For clarity, only links with weight above a certain threshold have been plotted. As a consequence only
top-physicists in this field are shown. The width of each connection is proportional to its weight and the size of the nodes is
proportional to the sum of all weights of incident links.
we divide this list in MI homogeneous intervals, where
homogeneous stands for intervals with the same num-
ber of references MR. In order to avoid abrupt changes,
we consider overlapping intervals, in the sense that the
q-th interval shares its first MR/2 references with the
(q − 1)-th interval and its last MR/2 references with the
(q + 1)-th interval. It should be noticed that this sharp
division may split references of the same citing paper into
different contiguous intervals, but this “border effect”
may be considered negligible since we consider MR much
larger than the average number of references per paper
(all results have been obtained by using MI = 39 and
MR = 488 000, while on average each paper has 20− 30
references). Moreover, we should remark that we can re-
late each interval with real time by simply associating the
average of the dates of all the references belonging to the
interval with the interval itself. However, since the rate
of citation per unit of time is increasing almost exponen-
tially with time, the homogeneity of references in each
interval does not correspond to homogeneity in time: for
instance the first interval spans more than 70 years of
publications (1893-1966), while the last interval is rep-
resentative for the publications of only one year (2006).
The choice MR = 488 000 adopted in this paper ensures
that intervals are representative of periods of time not
shorter than one year.
B. Properties of the Weighted Author Citation
Network
We provide in this section a simple statistical analysis
of the WACNs. In particular we monitor the number of
authors and their indegree and instrength distributions,
where for example the instrength of a node i is defined
as
sini =

j
wji , (1)
i.e., the sum of all weights of the links pointing to i [17].
First of all, it is interesting to note that quantitatively
Page 4
hidden
4the properties of the WACNs are not constant in time.
This is understandable since the production of scientists
has strongly changed during the last century.
Figure 3: (Color online) In the main plot, the total num-
ber of authors Ntot (yellow circles), number of authors with
outstrength larger than zero N(sout>0) =
P
j θ
`
soutj
´
(green
squares) and number of authors with instrength larger than
zero N(sin>0) =
P
j θ
`
sinj
´
(red diamonds) are plotted as
functions of the number of references (referenced papers),
where θ (·) is the step function equal to one when its argument
is larger than zero and null otherwise. In the inset the same
quantities as those of the main plot are considered, but now
they are plotted as functions of time. More specifically, each
x-value corresponds to the average publication year of papers
belonging to the respective dynamical slice of the main plot.
From Figure 3, one can qualitatively appreciate the
former observation: the total number of nodes in the
network (i.e., the number of scientists citing or cited in
a particular period of time) is an increasing function of
time. It should be stressed that this behavior is mainly
a consequence of the increment of scientists in physics as
one can deduce from the time-increment of the number of
nodes with non-zero instrength (i.e., cited authors) that
is growing in a much slower fashion.
The indegree distributions calculated on different
WACNs are generally different. Nevertheless, if we con-
sider the relative indicator given by the ratio of the cit-
ing authors (kin) to a scientist in a given WACN divided
by the average number (〈kin〉) of citing authors over all
physicists in the same WACN, the distributions of the
rescaled variable kin/〈kin〉 obey the same universal curve
[see Figure 4a]. This result is in accordance with the re-
markable scaling recently discovered on PCNs [18]. The
same is not valid for the instrength distribution since a
simple scale transformation does not seem to lead to a
universal behavior.
IV. SCIENCE AUTHOR RANK ALGORITHM
The author-to-author network can be used to define a
graph based ranking algorithm that uses the global fea-
Figure 4: (Color online) Probability densities for the inde-
gree (a) and the instrength (b). Calculations have been per-
formed on different WACNs based on papers published in dif-
ferent periods of time (yellow circles 1893− 2006, red squares
1893− 1966, gray diamonds 2005). The insets show the same
distribution as in the main plots, but opportunely rescaled by
their average values.
tures of the network to account for the impact of each
author. Analogously to various ranking algorithms such
as PageRank [11], CiteRank [15], the HITS scores [12],
etc., we define an iterative algorithm based on the notion
of diffusing scientific credits. In practice we imagine that
each author owns a unit of credit which is distributed to
its neighbors proportionally to the weight of the directed
connection. Each author thus receives a credit that is
then redistributed to neighbors at the next iteration and
so on. In other words, the SARA simulates the diffusion
of credits on the global network according to a diffusion
probability proportional to the weight of the links.
Let us be more specific. Once the WACN has been de-
fined as detailed in section III, we calculate the SARA
score for each node i according to
Page 5
hidden
5Pi = (1− q)

j
Pj
soutj
wji+qzi+(1− q) zi

j
Pj δ
(
soutj
)
.
(2)
Here Pi is the score of the node i, 1 ≥ q ≥ 0 is the damp-
ing factor, wji is the weight of the directed connection
from j to i, soutj is the outstrength of the node j (i.e.,
the sum of the weights of all the links outgoing from the
j-th vertex, soutj =

k wjk) and finally δ(x) = 1, if x = 0
and δ(x) = 0, otherwise. The first term on the r.h.s. of
Eq.(2) represents the diffusion of credit through the net-
work: scientist i receives a portion of credit from each
citing author j and each amount of credit is linearly pro-
portional to the weight wji of the arc linking j to i. The
second and the third terms stand from the redistribution
of credits to all scientists in the network. A portion q
of the credit of each node is redistributed to everyone
else (i.e., second term), with the exception of dandling
ends (i.e., nodes with null outstrength), which distribute
their whole credit (i.e., third term). The meaning of the
redistribution of credit is that everyone is in “scientific
debit” with the whole scientific community, since a gen-
eral background is at the basis of the knowledge of every
scientist. In particular, the credit is distributed homo-
geneously among papers in the network. The factor zi
takes into account the normalized scientific credit given
to the author i based on his productivity. zi is calculated
according to the formula
zi =

p δp,i 1/np

j

p δp,j 1/np
, (3)
where p represents the generic paper p and np the num-
ber of authors who have written the paper p. Moreover,
δp,i = 1 only if the i-th author wrote the paper p, oth-
erwise it equals zero. The sum runs over all different
papers (citing and cited). Basically, each paper receiv-
ing a credit is going to redistribute it equally among all
co-authors of the paper. The fact that the zis are not ho-
mogeneous (differently from the original formulation of
PageRank [11], where zi = 1/N , ∀ i with N total num-
ber of authors) is of fundamental importance: each paper
is carrying the same amount of knowledge independently
of the number of co-authors. The denominator of the
r.h.s. of Eq.(3) serves only for normalization purposes.
The stationary values of the Pis can be easily computed
recursively, by setting at the beginning Pi = zi , ∀i (but
the results are independent of the choice of the initial val-
ues) and iterating Eqs.(2) until they converge to values
stable within a priori fixed precision [24].
The scores calculated according to Eq.(2) depend on the
particular value chosen for the damping factor q. In all
results shown in this paper, we always set q = 0.1. This
is the value for which the predictive power of SARA is
maximized. An exploration of the dependence of the pre-
dictivity of SARA as a function of the damping factor q
is reported in Appendix B.
A. Ranking Authors
Figure 5: (Color online) Evolution of the relative rank ex-
pressed as top percentile of four Nobel laureates: “Bethe, HA”
(1967, black solid line), “Anderson, PW” (1977, red dotted
line), “Wilson, KG” (1982, blue solid line) and “De Gennes,
PG” (1992, yellow dashed line). Scientific merit is quantified
by using Eq.(4), which counts the author’s percentile as the
relative number of authors with better rank than the consid-
ered scientist. The figure shows how relative rank is related
in time with the Nobel prize (date of the award indicated by
the symbol). The diagram monitors the scientific carrier of
the awardees, essentially from the beginning, with the only
exception of “Bethe, HA”, whose activity began much earlier
than that of the other three scientists.
The SARA is used to provide a ranking of the authors
in the PR database. Given an author-to-author network,
we calculate the score of each author according to Eq.(2)
and assign a rank position to this scientist. The higher
is the score of a scientist, the higher is her/his rank. As
described in section III, we decided to preserve the longi-
tudinal nature of the PR database and construct WACNs
corresponding to dynamical slices of the database con-
taining the same number of citations. In this way we
can have a dynamical perspective on the evolution of the
merit of authors along the years.
As prototypical examples, we show in Figure 5 the evo-
lution of the relative rank of four Nobel Laureates. For
each author i we calculate its relative rank as
Ri = 1/N

j 6=i
θ (Pj − Pi) , (4)
which basically stands as the probability to find an au-
thor with better score than author i. N is the total num-
ber of authors in the WACN, while the step function θ(·)
is equal to one only when its argument is equal to or
larger than one, otherwise it is zero. The relative rank in
other words defines the top percentile of each scientist. It
should be stressed that the relative rank of Eq.(4) works
better than the absolute one in the case of comparison
of scientific performances in different historical periods,
Page 6
hidden
6Figure 6: (Color online) Scatter plots of SARA rank versus CC rank [(a) and (b)] and BCC rank [(c) and (d)]. Plots in (a) and
(c) refer to the author citation network based on papers published between 1893 and 1966, while plots in (b) and (d) have been
generated by using the author citation network based on papers published in 2005. In all insets, the same data as the ones
analyzed in the respective main plots have been logarithmically binned. For each bin we plot maximum and minimum values
(error bars), 90% confidence intervals (boxes) and median (horizontal bars inside boxes) of the SARA rank. In all plots, outlier
points stress the most significant differences between SARA and the other techniques. Authors badly ranked in CC or BCC
methods and well classified in SARA are generally very prominent physicists. By looking at figures (a) and (c) for example,
we see scientists of the caliber of “Jordan, P” and “Weyl, H” occupy the top-positions in SARA ranking, while their ranks are
two orders of magnitude smaller according to CC or BCC methods. On the other hand, the majority of authors poorly ranked
by the SARA technique and well ranked by CC method correspond to poorly defined identifiers referring in general to multiple
physical persons [see figure (b)]: names like “Li, J” or “Yu, Z” are very common in China and for this reason their CC score is
very high; SARA differently is able to capture the low scientific relevance of all these authors, ranking them at positions about
three orders of magnitude higher than the ones obtained with the CC method.
since the number of authors in the WACN is increasing
rapidly in time (see Figure 3).
From Figure 5, we can clearly see that relative rank
dynamics of Nobel laureates is qualitatively related in
time with the achievement of the prize: top-performances
are reached close to the date of the assignment of the
honor. Indeed, it is worth remarking that the method
naturally accounts for the fact that the rate of citations
per unit time is steadily increasing through the years by
defining dynamical slices of the database containing the
same number of citations. Discounting old citations, the
author’s rank becomes a dynamical quantity that changes
according to the author’s research activity as well as the
success of new research fronts. Thus, rank is related to
Page 7
hidden
7the actual impact of the research of an author at a given
time and is changing through the years.
V. COMPARISON WITH DIFFERENT
METRICS
Assessing the reliability and the results of any rank-
ing method is not easy. The main question is to which
extent the SARA algorithm is providing a better rank
than other ranking methods commonly used in scientific
impact analysis. For this reason, we consider two basic
measures which are commonly used to rank authors. The
first is the Citation Count (CC) with which authors are
simply ranked by the total number of citations received
in a given time window (note that the number of cita-
tions does not correspond to the indegree of the author
in the citation network). CC is traditionally the simplest
and mostly used quantity for measuring the scientific im-
pact: popular indicators, as the h-index [4] for instance,
are based on this simple metrics. The second measure is
the Balanced Citation Count (BCC) that discounts the
effect of multiple authored papers in the citation count
by normalizing the citation weight by the total number
of authors of the cited paper [i.e., authors are ranked on
the basis of their instrength as defined in Eq. (1)]. As a
first comparison of the rankings obtained with the three
different methods, we show in Figure 6 the scatter plot
in which each author is identified by its SARA ranking
and CC or BCC rank. If the methods provide the same
ranking all the points would fall on the diagonal. Fluc-
tuations are indicated by the cloud of the scattered plot
about the line indicating the linear behavior. Indeed, it
is possible to show that, in the absence of degree-degree
correlations in the network, diffusion algorithms such as
the SARA are providing a score that is on average pro-
portional to the indegree dependence of the diffusion pro-
cess [19]. However, important fluctuations appear: some
nodes can have for example a low SARA rank despite a
modest indegree, whereas some others can have a surpris-
ingly large SARA despite a high indegree, as it is possible
to see in Figure 6. We believe that the potential refine-
ment offered by this method is its ability to uncover such
outliers. It is interesting to see that most of the outliers
corresponding to authors badly ranked with the CC and
BCC methods are indeed very important scientists that
are highly ranked with our method.
VI. BENCHMARKING THE SCIENCE
AUTHOR RANK ALGORITHM
The previous analysis is not an accurate author by au-
thor analysis but a procedure to identify the most evident
outliers. In order to produce a more refined analysis on
the effectiveness of the SARA ranking, we test the pre-
dictive power of the three ranking methods by studying
the assignment of major prizes and awards (in Ref. [20]
Figure 7: (Color online) We consider some of the main prizes
in Physics (Nobel prize, Wolf prize, Boltzmann medal, Dirac
medal and Planck medal). To each prize, we associate the
best performance of the scientist who earned that honor. The
performance of an author at a given time is quantified by the
author’s percentile defined as the percentage of other authors
who have a better rank at the same time [see Eq. 4]: the
lower is this percentage, the better is the performance of the
considered scientist. SARA is more predictive than both CC
and BCC: according to SARA ranking, the 35% of the prizes
have been assigned to scientists who have reached a position
below the 0.1%. The SARA tells that 77% of the considered
honors have been earned by scientists with a best performance
rank lower than 1%. As term of comparison, according to CC
(BCC) ranking the former rate decreases to 66% and 67%,
respectively.
it has been already shown that scientists with high CC
scores have high probability to earn a Nobel prize in their
discipline). We expect that a better performing ranking
would identify most of the award winning authors by
placing those at very top ranks. In other words we as-
sume that awards and prizes are an outcome of a peer
performed rank analysis that singles out the most highly
ranked authors. This human ranking process, obtained
with the hard work of committees and the help (in many
cases) of the whole community can be considered as a
benchmark for the ranking algorithms. We expect that
the better the algorithm is performing, the more awarded
authors will be found in the top rank brackets. In Fig-
ure 7, we see how SARA improves the prediction in the
assignments of major prizes in Physics with respect to
both CC and BCC methods. The probability to earn a
prize is consistently higher for authors who have reached
top rank positions [25] according to SARA than for sci-
entists who have occupied the same positions in CC or
BCC rankings.
Finally, we provide a table [see Table 1] with best
ranked scientists at the end of years 1973 (period 1967-
Page 8
hidden
8Table 1: (Color online) Top 20 scientists according to the SARA method. The rankings are determined by
considering all papers published in the periods 1967-1973 (left) and 2003-2004 (right). We highlighted in gray
scientists, who have not yet earned any of the major prizes [NP=Nobel Prize, WP=Wolf Prize, BM=Boltzmann
Medal, DM=Dirac Medal, PM=Planck Medal]. ”Kohn, W” has earned the NP in Chemistry in 1998.
1973) and 2004 (period 2003-2004), where we single out
those who have not yet received any of the major awards
we considered in the present analysis. It is important
to stress that some prizes are disciplinary and cannot
apply to all authors. Nevertheless, the majority of the
scientists (16 out of 20) listed in the left part of table 1
(period 1967-1973) have earned one of the prizes consid-
ered in this analysis. On the other hand, all scientists
listed in the right part of table 1 (year 2004) are, by our
knowledge, top-physicists in their field of research and
probably eligible to very important prizes in Physics not
only in accordance with our criteria.
VII. CONCLUSIONS
In this paper we propose a new measure for ranking
scientists mimicking the spread of scientific credits
among authors. The proposed technique, called Science
Author Rank Algorithm (SARA), is similar in spirit to
the standard ranking procedure implemented for pages
in the World Wide Web [11]. SARA is based on a mixed
process, where a biased random walk is combined with
a random distribution of the credits among the nodes.
On a global level, the algorithm takes into account that
inlinks from highly ranked authors are more important
than inlinks from authors with low rank and measures
the non-local effects of the spreading of scientific credits
into the network. The non-local characteristics of this
algorithm are evident as any author can in principle
impact the score of far away nodes through the diffusion
process and the fact that the score of an author is
more affected by the score of its neighbors than the raw
number of inlinks.
We apply SARA on Weighted Author Citation Networks
(WACNs) directly constructed from the paper citation
network based on articles published in the Physical
Review (PR) collection between 1893 and 2006. This
large dataset allows the estimation through SARA scores
of the scientific relevance of physicists along time. The
time behavior can be monitored by simply using the
longitudinal nature of the PR database and therefore
constructing WACNs representative of different periods
of time. A quantitative comparison between rankings
obtained via SARA scores or other more popular heuris-
tics shows the great improvement that can be obtained
by considering the whole citation network instead of
only its local properties.
As practical application of our ranking
recipe, we have developed a Web platform
(http://www.physauthorsrank.org) where the evolu-
tion of the scientific relevance of all physicists, with at
least a publication in PR journals before 2006, can be
plotted. The Web site offers several additional features
such as the evaluation of the authors’ rank in their
specific topical area.
While we believe that the methodology exemplified by
our approach entails more information than the simple
citation counts or the metrics derived from this quan-
tity, including the h-index and its related measures, we
want to be the first to spell out clearly the many caveats
Page 9
hidden
9deriving by a non-critical approach to similar ranking
approaches. First of all it is worth remarking that the
present algorithm takes into account only the PR dataset.
While this may be appropriate to rank authors within
the physics community, it is clear that it does belittle
the rank of authors who have got a large impact in other
areas or disciplines. This problem might be mitigated
by the inclusion of other databases or very extensive
citation repositories. The inclusion of larger reposito-
ries however would amplify the disambiguation problem
and this endeavour might not be straightforward. For
this reason we have added to our web platform the user
disambiguation process. The hope is that a collabo-
rative web2.0 approach may help in achieving progres-
sively cleaner datasets. A similar procedure has been
recently proposed by Thomson Reuters with the web site
http://www.researcherid.com [21], where authors are
asked to link their ResearcherID to their own articles.
Another issue is the fact that our scientific credit spread-
ing is considering credits and citations just as a positive
indicator of impact. It is debated in the community how
to consider the effect of the so-called negative citations
aimed at contradicting previous results or conclusions.
This is however a very subtle point as it is almost im-
possible to say to which extent this kind of citations are
negative. In many cases even flaws or error may have
the merit to open new direction of research or the path
to novel approaches. While we prefer not to enter this
discussion here it has to be kept in mind that our method
could be extended to define negative scientific credit. A
final warning is concerning the general use and exploita-
tion of the global ranking approaches. It is clear that
the obtained ranking is just an indicator and cannot em-
brace the multifaceted nature and the many processes at
the origin of authors’ reputation. The obtained ranking
has therefore to be considered as an extra element to be
used with grain of salt and especially in terms of “order
of magnitude” more than in absolute value.
Acknowledgments
This work is partially supported by the Lilly Endow-
ment grant 2008 1639-000. to A.V. the grant of the Euro-
pean Community number 238597 ICTeCollective to S.F..
We acknowledge the American Physical Society for pro-
viding the data about Physical Review’s journals.
[1] L. Egghe & R. Rousseau, Introduction to Informetrics:
quantitative methods in library, documentation and in-
formation science, (Elsevier, Amsterdam, 1990).
[2] E. Garfield, Citation Indexing. Its Theory and Applica-
tions in Science, Technology, and Humanities, (Wiley,
New York, 1979).
[3] R. Adler, J. Ewing & P. Tay-
lor, IMU Report: Citation Statistics,
http://www.mathunion.org/Publications/Report/CitationStatistics
(2008).
[4] J. E. Hirsch, Proc. Natl. Acad. Sci. USA 102, 16569-
16572 (2005).
[5] M. E. J. Newman, Proc. Natl. Acad. Sci. USA 98, 404-
409 (2001).
[6] M. E. J. Newman, Phys. Rev. E 64, 016131 (2001).
[7] M. E. J. Newman, Phys. Rev. E 64, 016132 (2001).
[8] A.L. Baraba´si, H. Jeong, Z. Neda, E. Ravasz, A. Schubert
& T. Vicsek, Physica A 311, 590-614 (2002).
[9] S. Redner, Eur. Phys. J. B 4, 131-134 (1998).
[10] P. Chen, H. Xie, S. Maslov,& S. Redner, Journal of In-
formetrics 1, 8-15 (2007).
[11] S. Brin & L. Page, Computer Networks and ISDN Sys-
tems 30, 107-117 (1998).
[12] J. Kleinberg, Journal of the ACM 46, 604 (1999).
[13] C. Castillo, D. Donato & A. Gionis, Lecture Notes in
Computer Science, (Springer-Verlag, Berlin, 2007).
[14] A. Sidiropoulos & Y. Manolopoulos, Journal for Systems
& Software 79, 1679-1700 (2006).
[15] D. Walker, H. Xie, K. K. Yan & S. Maslov, J. Stat. Mech.
P0610 (2007).
[16] S. Redner, Phys. Today 58, 49-54 (2005).
[17] A. Barrat, M. Barthe´lemy, R. Pastor-Satorras &
A. Vespignani, Proc. Natl. Acad. Sci. USA 101, 3747-
3752 (2004).
[18] F. Radicchi, S. Fortunato & C. Castellano, Proc. Natl.
Acad. Sci. USA 105, 17268-17272 (2008).
[19] S. Fortunato, M. Boguna, A. Flammini & F. Menczer,
Proc. WAW 2006 LNCS 4936, 59-71 (2008).
[20] E. Garfield, Essays of an Information Scientist 4, 182-187
(1986).
[21] M. Enserink, Science 323, 1662-1664 (2009).
[22] PACS stands for Physics and Astronomy Classification
Scheme. This scheme is nowadays universally adopted
by the majority of Physics journals in order to well clas-
sify papers. Since 1980, Physical Review’s journals have
started to associate a set of PACS numbers (on average
three PACS numbers per paper) with every published
paper.
[23] Actually, the total number of internal references reported
by the PR database is 3 866 822, but 351 of them are
clearly wrong since they refer to papers citing newer pa-
pers (i.e., the year of publication of the citing paper is
smaller, in some case even of 30− 40 years, than the one
of the cited paper). We cannot a priori exclude the pos-
sibility of other wrong internal references, but there is no
other simple method to determine whether a reference is
good or not.
[24] If t stands for the stage of convergence, this means
˛
˛
˛
P (t−1)i − P
(t)
i
˛
˛
˛
< ǫ , ∀ i, where ǫ represents the a priori
fixed precision. Here we set ǫ = 10−6; typically 20 − 30
iterations are needed for convergence.
[25] The best performance Rmi of scientist i is calculated ac-
cording to Rmi = mint Ri (t), where Ri (t) is the relative
rank defined in Eq.(4) of the i-th author in the WACN
corresponding to the t-th time slice of the PR database.
Page 10
hidden
10
Appendix A: IDENTIFICATION AND
DISAMBIGUATION OF AUTHORS
The list of references enables the construction of an
error-free network of citation between articles. However,
in this paper we are not interested in the analysis of pa-
per citation networks (PCNs), but on one of their partic-
ular projections: the Weighted Author Citation Network
(WACN). We present a detailed description on the way
in which we construct the WACN in section III. Here we
would like to focus about possible sources of error, caused
by the format of the PR dataset itself, associated with
the projection of a network of citation between papers
into the correspondent WACN.
Whether authors can be well identified or not is still an
open problem. Every author in the database has always a
first and a last name. Many of them also have additional
names, generically indicated as middle names. First (and
middle) names may appear in their full version or they
can only be represented by the first letter. Writing first
(and middle) names in their complete version is typically
more common in recent papers and in papers with short
lists of authors. On a total of 1 916 812 repetitions for the
authors (this means the sum of all authors, not only dif-
ferent authors, over all the papers) the first names appear
1 564 251 times with just their first letter and the remain-
ing 352 561 times in their full version. The simplest (and
actually implemented) way to identify and distinguish
authors is to assign to each author an identifier (ID) in
accordance with the following rule
Figure 8: (Color online) We consider only the IDs of authors
with full version of their first names. Then, we count the
number of times d the same ID is obtained from authors with
different first names (plus middle names, if present). The
probability P (d) (plotted as yellow circles) of finding an ID
with “degeneracy” in the first name equal to d has a power
law decay as d increases (the dashed line has exponent equal
approximately to −3).
LAST-NAME , F. M.
LAST-NAME , FIRST-NAME MIDDLE-NAME
}
⇒ LAST-NAME , FM .
(A1)
This means for example that according to rule A1
“Einstein, Abert” has ID equal to “Einstein, A” while
the ID of “Bethe, Hans Albrecht” is “Bethe, HA”. Es-
sentially, the last name is taken in its full version, while
for the first and the middle names we consider only the
first letters. Proceeding in this way we are able to dis-
tinguish 216 623 “different” authors.
This approach is however biased by two main sources
of error. First, there is a problem of identification for
the authors. Unfortunately, scientists do not always sign
their papers using the same name and this has as a conse-
quence the impossibility to automatically relate different
names to the same physical person. This fact may hap-
pen for several reasons: different order between first and
last name; possible presence or absence of middle names;
change of last names (this happens especially to ladies
after their wedding).
The second problem is basically the reverse of the for-
merly described source of error: the obvious impossibility
to distinguish authors having same initials and the same
last name by using only this information. We did not try
to perform any kind of more elaborated analysis since
this is still an open problem in bibliometrics and mainly
because this was beyond the purposes of our paper. Fur-
thermore, a simple analysis revealed that the number of
“pathological” cases is expected to be small enough to
be considered irrelevant for the results reported in the
paper.
In order to evaluate the relevance of the error introduced
by the impossibility to disambiguate IDs, we consider
only papers of our database signed by authors using the
full version of their first and last names (and eventually
their middle names). Unfortunately, this happens only
in recent papers (from 1980 on) and only when the list of
authors is sufficiently short (less than four, in general):
this means that is very unlikely to happen. As already
mentioned, the total number of “signatures” (i.e., the
total number of non-distinct authors who have signed
all papers in our database) is 1 916 812, while the num-
ber of times in which an author has signed with her/his
“full signature” is only 352 561. Based on this subset, we
perform the reduction described in rule (A1). We then
calculate the probability P (d) by simply counting the ra-
tio between the total number of IDs shared by d different
scientists and the total number of IDs. The resulting dis-
tribution is plotted in Figure 8: in the 92% of the cases
an ID corresponds to a single author; the rest of the dis-
tribution has a power law decay (i.e., P (d) ∼ d−δ) as d
increases (the exponent δ ≃ 3).
Page 11
hidden
11
Figure 9: The rankings calculated with SARA for q = 0.1 are plotted as function of the rankings obtained with the same
algorithm but for different values of q: (a) q = 0.01, (b) q = 0.15 and (c) q = 0.3. All plots have been generated from the
WACN based on all papers published between 1893 and 1966 (the same dataset as the one used in Figures 6a and 6c of the
main text).
Appendix B: SCIENCE AUTHOR RANK
ALGORITHM: DEPENDENCE ON THE
DAMPING FACTOR
Science Author Rank Algorithm (SARA) depends on
the so-called damping factor q [see Eq. 2]. q is a real num-
ber in the interval [0, 1] and the results calculated with
SARA for different values of q may differ. As a prac-
tical example, we report in Figure 9 some scatter plots
between SARA rankings calculated for different values
of q. As expected, SARA rankings calculated for differ-
ent q are linearly correlated and the correlation strength
decreases as the difference between the qs increases.
Figure 10: (Color online) Percentage of prizes earned by
physicists who have reached a given rank position as their
best performance. Generally, the SARA is more predictive
than the simple CC criterion since top scientists in SARA
ranking have higher chances to earn a prize than top authors
in the analogous ranking based on CC.
The decision to set q = 0.1 is based on a special anal-
ysis which is graphically reported in Figure 10. For each
scientist, who earned one of the major prizes in Physics,
we computed her/his best performance during her/his
scientific history. We then plotted the ratio of prizes
assigned to scientists with the best performance falling
in a given interval (note that the intervals’ division is
totally arbitrary, but the results do not strictly depend
on this choice). According to any reasonable measure of
scientific impact, the probability that a scientist earns
an important prize should be related to her/his scientific
relevance. In the case of SARA ranking, we generally
observed that the majority of prizes is assigned to sci-
entists who have reached a top position in the ranking.
This allows us to justify the use of such measure for the
scientific impact of authors. Moreover, as already stated
and shown (see Figure 7), SARA is more effective than
other well known criteria like Citation Count (CC) or
Balanced Citation Count (BCC) if one wants to predict
future winners of prizes. Anyway, also in the case of
SARA, the predictivity of the algorithm may quantita-
tively change as function of q. Looking at Figure 10, we
see for instance that, in the top intervals, the highest
ratios are reached for values of q ≃ 0.1, while values of
q < 0.1 or q > 0.1 give lower ratios in these first two bins.
As a consequence, we can say that q = 0.1 is the optimal
value for SARA since it is the value which maximizes the
predictivity of our algorithm.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

77 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
34% Ph.D. Student
 
19% Post Doc
 
8% Other Professional
by Country
 
18% United States
 
14% United Kingdom
 
9% Germany