Sign up & Download
Sign in

Citation Proximity Analysis (CPA) - A new approach for identifying related work based on Co-Citation Analysis

by Bela Gipp, Joeran Beel
Proceedings of the 12th International Conference on Scientometrics and Informetrics ISSI09 (2009)

Cite this document (BETA)

Available from Joeran Beel's profile on Mendeley.
Page 1
hidden

Citation Proximity Analysis (CPA) - A new approach for identifying related work based on Co-Citation Analysis

Citation Proximity Analysis (CPA) – A new approach for identifying
related work based on Co-Citation Analysis

Bela Gipp1 and Jöran Beel2
1 Bela@Gipp.com, 2 Joeran@Beel.org
Otto-von-Guericke University, Dept. of Computer Science, Magdeburg, Germany

Abstract
This paper presents an approach for identifying similar documents that can be used to assist scientists in finding
related work. The approach called Citation Proximity Analysis (CPA) is a further development of co-citation
analysis, but in addition, considers the proximity of citations to each other within an article‟s full-text. The
underlying idea is that the closer citations are to each other, the more likely it is that they are related. In
comparison to existing approaches, such as bibliographic coupling, co-citation analysis or keyword based
approaches the advantages of CPA are a higher precision and the possibility to identify related sections within
documents. Moreover, CPA allows a more precise automatic document classification. CPA is used as the
primary approach to analyse the similarity and to classify the 1.2 million publications contained in the research
paper recommender system Scienstein.org.

Introduction and Motivation
The search for related scientific work can be tedious, and often important documents are
missed out. Difficulties are caused by an increasing number of publications, growing
exponentially at a yearly rate of 3.7 %, unclear nomenclature, synonyms and numerous other
factors [1]. In practice, most searches for related work start with some initial papers and
navigating the citation web nearest to those papers. However, even the more advanced
approaches for identifying related work based on co-word analysis, collaborative filtering,
Subject-Action-Object (SAO) structures or citation analysis do often not deliver satisfying
results [2-8]. Therefore, we developed a new approach to determine the similarity of
documents, which we name Citation Proximity Analysis (CPA). The approach is based on co-
citation analysis and improves precision by considering the position of citations. The
presented approach was developed for the research paper recommender Scienstein1 to assist
researchers in finding related work.

The first part of this paper gives an overview about existing methods to identify similar
documents, whereas the focus lies on the most popular citation analysis approaches and their
strengths and weaknesses. The second part explains how the CPA can be used to measure
similarity and the steps necessary to calculate a new metric that we call Citation Proximity
Index (CPI). Afterwards, first results from an empirical study comparing the performance of
co-citation analysis and CPA are presented. Finally, an outlook on further implications and
how the CPA could be used in other fields is given.

1
www.scienstein.org is a research paper recommender focusing on identifying related work developed by the
authors
Bela Gipp and Jöran Beel. Citation Proximity Analysis (CPA) - A new approach for identifying related work based on Co-Citation Analysis. In
Birger Larsen and Jacqueline Leta, editors, Proceedings of the 12th International Conference on Scientometrics and Informetrics (ISSI’09),
volume 2, pages 571–575, Rio de Janeiro (Brazil), July 2009. International Society for Scientometrics and Informetrics. ISSN 2175-1935.
Downloaded from www.sciplore.org.

Page 2
hidden
Related Work
Various approaches exist to determine the degree of similarity of documents in order to
identify related work. Whereas text-mining approaches are used in cases in which references
are not stated, citation analysis approaches usually deliver superior results as e.g. synonyms
and unclear nomenclature do not lead to misleading results [3, 4, 5]. Many citation analysis
approaches exist and they all have their own strengths and weaknesses for identifying similar
documents. Among the most widely used are the easily applicable „cited by‟ approach, which
considers papers as relevant that cite the same input document and the „reference list‟
approach, which considers papers as relevant that were referenced by the input document. The
best results can usually be obtained by bibliographic coupling and co-citation analysis, which
allow calculating the coupling strength [6]. These approaches, which were already invented in
the 60s and 70s, are used by scientists and on academic search engine websites like CiteSeer2
[9].

Documents are bibliographically
coupled if they cite one or more
documents in common. Figure 1
illustrates this approach: Papers A and B
are related because they both cite papers
C, D and E.

In contrast, two documents are “co-
cited” when at least one paper cites both.
This approach is illustrated in Figure 2:
Papers A and B are related because they
are both cited by papers C, D and E. The
more co-citations two papers receive, the more related they are [6].

Although both approaches are suitable to identify similar papers, they serve different
purposes. Whereas bibliographic coupling is retrospective, co-citation is essentially a
forward-looking perspective [9]. However, both approaches often deliver unsatisfying results,
since they only make use of the bibliography at the end of the document without analysing the
constellation of citations. Since these approaches are system-inherent, it is also not possible to
determine in which part of a related document the content of interest can be found.

Citation Proximity Analysis (CPA)
Instead of just using the bibliography, in CPA the information derived from the proximity of
the citations to each other in the full-text is used to calculate the Citation Proximity Index
(CPI) in three steps.

1. The document is parsed and a series of heuristics are used to process the citations including
their position within the document3.


2
http://citeseer.ist.psu.edu
3
The citations were parsed using a modified version of parsCit (http://wing.comp.nus.edu.sg/parsCit) in
combination with exclusively developed software, which is available upon request from the authors.
Doc A
citing
Doc B
citing
Doc
C
Doc
D
Doc
E
cites cites
Doc A
cited
Doc B
cited
Do
C
Doc
D
Doc
E
cites cites
Figure 2: Co-citation analysis Figure 1: Bibliographic coupling

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

10 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
20% Student (Bachelor)
 
20% Other Professional
 
20% Doctoral Student
by Country
 
30% Germany
 
20% United States
 
10% United Kingdom