Sign up & Download
Sign in

Power-law distributions for the citation index of scientific publications and scientists

by Hari M Gupta, José R Campanha, Rosana A G Pesce
Brazilian Journal of Physics ()

Abstract

The number of citations of a scientific publication or of an individual scientist has become an important factor of quality assessment in science. We report a study of the statistical distribution of the citation index of both scientific publications and scientists. We give numerical evidence that Tsallis (power law) statistics explains the entire distribution over eight orders of magnitude (104 to 104 ). Also, we draw Zipf plots in order to analyze the statistical distribution of the citation index of Brazilian and international physicists and chemists. The relatively small group of Brazilian scientists seems more adequate to explain the dynamics of the citation index. In this case, we find that the distribution of the citation index can also be explained by a gradually truncated power law with similar parameters. We finally discuss possible mechanisms behind the citation index of scientists and scientific publications.

Cite this document (BETA)

Available from www.scielo.br
Page 1
hidden

Power-law distributions for the c...

Brazilian Journal of Physics, vol. 35, no. 4A, December, 2005 981 Power-Law Distributions for the Citation Index of Scientific Publications and Scientists Hari M. Gupta, Jos�� e R. Campanha, and Rosana A. G. Pesce Departamento de F�� ��sica, Instituto de Geociencias �� e Ciencias �� Exatas, UNESP, Caixa Postal 178, CEP 13500-970, Rio Claro, SP, Brazil Received on 16 June, 2004. Revised version received on 15 September, 2005. The number of citations of a scientific publication or of an individual scientist has become an important factor of quality assessment in science. We report a study of the statistical distribution of the citation index of both scientific publications and scientists. We give numerical evidence that Tsallis (power law) statistics explains the entire distribution over eight orders of magnitude (10-4 to 104). Also, we draw Zipf plots in order to analyze the statistical distribution of the citation index of Brazilian and international physicists and chemists. The relatively small group of Brazilian scientists seems more adequate to explain the dynamics of the citation index. In this case, we find that the distribution of the citation index can also be explained by a gradually truncated power law with similar parameters. We finally discuss possible mechanisms behind the citation index of scientists and scientific publications. I. INTRODUCTION In recent years physicists turned to the study of natural sys- tems as a whole rather than in parts [1-6]. The difficulties in understanding these ���complex systems��� arise from the large number of elementary interactions that are taking place at the same time for a large number of components. Also, these sys- tems are in constant evolution and do not have a usual equi- librium state [1]. Socio-economical and biological systems display these general features, and have been treated by physi- cists. Scaling power laws [7,8] have been found in many bi- ological [9-11], physical [2,12-20] and socio-economical sys- tems [21-29], and they are now considered as an important property of these systems. Scientific publications are a primary means of scholarly communication in science. The quality of a scientific paper or of an individual scientist can be gauged by the number of citations in the work of other authors. Although this cannot be an exact measurement of the relevance of either a paper or a scientist, it can be taken as a particular and reasonable mea- sure. One of the problems of our scientific community is to know the mechanisms and the distribution of (i) the number of publications of a scientist, (ii) the number of citations, or citation index, of a scientific publication, and (iii) the citation index of a scientist. In 1957, in a study of the publication record of the scientific research staff at Brookhaven National Laboratory, Shockley [30] claimed that the rate of scientific publications is described by a log-normal distribution. Laherrere and Sornette [31] pre- sented numerical evidence, on the basis of data for the 1120 most cited physicists from 1981 to June 1997, that the citation distribution of individual authors is associated with a stretched exponential form, N(x) ��� exp h -(x/x0)�� i , with �� ��� 0.3. Us- ing the technique of the Zipf plots, Redner [32] has recently shown that the distribution of citations of the most cited sci- entific papers is described by a power law, N(x) ��� x-��, with �� ��� 3.0. Tsallis and Albuquerque [33] claim that the newly proposed ���Tsallis statistics��� can as well account for the distri- bution of citations of scientific papers. The number of publications and the citation index are dif- ferent concepts. The number of publications of a scientist rep- resents the amount of work that he has done, while the cita- tion index is much closer to representing the quality of this work. The number of publications depends on the capacity to work and to get papers published, while the citation index is related to factors as the originality, the interest in the com- munity, and the relevance of particular research topics. The number of scientific works published by a scientist depends on some factors, as choosing a proper problem, working on this problem, choosing a proper journal, writing ability. As pointed by Schockley [30], a log-normal form is expected to account for the distribution of published scientific papers. In the present paper we discuss the statistical distribution and the mechanisms behind the citation index of a scientific publication and of an individual scientist. In Section II, we present the model. In Section III, we analyze the statistical distribution of the citation index of scientific publications in 1981, which were cited between 1981 and June of 1997. We also analyze and compare the distributions of the citation in- dex of highly cited Brazilian and international physicists and chemists. In Section IV, we discuss the results and possible mechanisms underlying the citation index. II. THE MODEL A power-law distribution [7,8] has been first observed by Pareto in economics [8] in 1897. Pareto claimed that it was related to a positive feedback, namely that wealthy people can more efficiently level their wealth than the average individu- als, so they can create more wealth and achieve an even higher level of income. Recently, we have related power-law distrib- utions to effects of competition, learning and natural selection [34]. In the context of nonextensive thermodynamics, Tsallis [35,36] was able to obtain power-law distributions with the in- clusion of long-range interactions and long-range microscopic
Page 2
hidden
982 Hari M. Gupta et. al. memory. The Tsallis generalized entropy is given by S = k 1 - ���i pi q q - 1 , (1) where k is a positive constant, q is a parameter, and the sum is over the probabilities of the statistical states. On the ba- sis of this definition, Tsallis and Albuquerque [4] derived the statistical distribution N(x) = N0 [1+(q- 1)��x] q q-1 , (2) where N(x) is the probability density, �� is a parameter, and N0 is a normalization constant. This formula can also be written as N(x) = N0 [1+ c1x](1+��) , (3) where c1 is a constant and (1 + ��) is a power-law index. For large values of x, this distribution becomes a simple power law, N(x) ��� cx-(1+��). (4) In this limit, logN(x) versus logx is just a straight line. In real systems, power-law distributions cannot continue for ever. They have to be somehow truncated in order to avoid an infinite variance. For scientific publications, the research field becomes saturated or almost fully investigated after a certain time, which may be roughly taken between 20 to 100 years, depending on the particular field. Researchers in this area, and citations as well, begin to decrease after this period of satura- tion. In addition to the saturation of the field, there are human limitations to the production of a large number of relevant sci- entific works. Recently, we have shown that, by gradually truncating a power-law distribution after a certain critical value, it is possi- ble to explain the entire distribution including very large steps in financial and physical complex systems [37-39]. In this work, the power-law distributions come from a positive feed- back which gradually ceases after a certain step size due to limited physical capacity of the components of the system or the system itself. In this limit, these distributions approach a normal form [37]. This approach may also lead to a distribu- tion of the citation index, given by N(x) = cx-(1+��) f (x), (5) with f (x) = ��� ��� ��� 1 if |x| 6 xc exp ��� - ��� (|x|-xc) k ���� if |x| xc , (6) where xc is the critical value of the step size at which the prob- ability distribution begins to deviate from a power-law distrib- ution due to physical limitations, and k is related to the sharp- ness of the cut-off. Comparing to a normal distribution, we have �� = 2 - ��. (7) We now consider two special cases: (i) case I, if x 6 xC, with N(x) = cx-(1+��), (8) which gives a power-law distribution (ii) case II, if x xC, in which the variation due to f (x) is dominant, and we have N(x) ��� exp ( - |x| k ����) , (9) so logN(x) versus x�� is a straight line. This gives a stretched exponential distribution. The publication density is usually very small for highly cited papers. It is then interesting to draw a Zipf plot [40], in which the number of citations of the nth most cited paper out of an ensemble of M papers is plotted versus the rank n. By its very definition, the Zipf plot is closely related to the cu- mulative large-x tail of the citation distribution, which makes it well suited for determining the large-x tail of the citation distribution. Also, it smooths out the fluctuations in the high- citation tail and thus facilitates a quantitative analysis. Given an ensemble of M papers and the corresponding number of citations for each of these papers, according to the rank order, Y1 Y2 Y3 ... Yn ...YM, the number of cita- tions of the nth most cited paper Yn may be estimated from the criterion [24] Z ��� Yn N(x)dx = Z ��� Yn M.P(x)dx = n. (10) This equation means that there are n out of the ensemble of M papers which are cited at least Yn times. From the dependence of Yn with n in a Zipf plot, we can test whether it agrees with a proposed form for N(x). For a simple power law distribution, using Equations (4) or (8), we obtain Yn = c1M��n- 1 �� , (11) which can also be written as logYn = (- 1 �� )logn+ b, (12) so logYn versus logn gives a straight line and b is a constant. For a stretched exponential distribution, using Equation (9), we obtain Yn�� = -alnn+ b, (13) where a and b are constants. In this case, Yn�� versus lnn is a straight line. III. DATA ANALYSIS A. Scientific Publications We now investigate one of the largest data sets of scientific publications, stored by the Institute for Scientific Information,

Readership Statistics

14 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
43% Researcher (at a non-Academic Institution)
 
14% Other Professional
 
14% Professor
by Country
 
21% Poland
 
14% Brazil
 
14% Spain

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in