Discovering important nodes through graph entropy the case of Enron email database
- ISBN: 1595932151
- DOI: 10.1145/1134271.1134282
Abstract
A major problem in social network analysis and link discovery is the discovery of hidden organizational structure and selection of interesting influential members based on low-level, incomplete and noisy evidence data. To address such a challenge, we exploit an information theoretic model that combines information theory with statistical techniques from area of text mining and natural language processing. The Entropy model identifies the most interesting and important nodes in a graph. We show how entropy models on graphs are relevant to study of information flow in an organization. We review the results of two different experiments which are based on entropy models. The first version of this model has been successfully tested and evaluated on the Enron email dataset.
Author-supplied keywords
Discovering important nodes through graph entropy the case of Enron email database
The Case of Enron Email Database
Jitesh Shetty
University of Southern California
University Park
Los Angeles, CA 90089
jshetty@usc.edu
Jafar Adibi
USC Information Sciences Institute
4676 Admiralty Way
Marina del Rey, CA 90292
adibi@isi.edu
ABSTRACT
A major problem in social network analysis and link discovery is
the discovery of hidden organizational structure and selection of
interesting in uential members based on low-level, incomplete and
noisy evidence data. To address such a challenge, we exploit an
information theoretic model that combines information theory with
statistical techniques from area of text mining and natural language
processing. The Entropy model identi es the most interesting and
important nodes in a graph. We show how entropy models on
graphs are relevant to study of information ow in an organiza-
tion. We review the results of two different experiments which are
based on entropy models. The rst version of this model has been
successfully tested and evaluated on the Enron email dataset.
Categories and Subject Descriptors
H.4 [Link Discovery, Data Mining, Social Network Analysis]:
Miscellaneous; D.2.8 [Graph Theory]: Social Networks
General Terms
Graph theory
Keywords
Entropy, Link Discovery
1. INTRODUCTION
A new challenge in the area of Link Discovery (LD) [18]. and
social network analysis (SNA) is to exploit communication pat-
tern information and text information within knowledge discovery
processes such as discovery of hidden organizational structure and
selection of interesting prominent members. An interesting exam-
ple of such a challenge is to discover hidden groups and prominent
people by analyzing their email logs.
Email logs have been considered as a useful resource for research
in such areas. Email logs are of prime importance and relevance
in the study of information ow in an organization. Email has be-
come the vital means of communication in the information commu-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for pro t or commercial advantage and that copies
bear this notice and the full citation on the rst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speci c
permission and/or a fee.
KDD ’2005 Chicago, Illinois
Copyright200XACM 1-59593-215-1...$5.00.
nity. Inherent advantages like ease of sending an electronic mail,
archiving communications and the ability to reference past com-
munications have made email the most acceptable and widely used
means of communication. Though it is highly used in the business
and professional domain its scope is not con ned to it. Email is
the most archived evidence data on interpersonal communication
in electronic form. It can also act as an evidence database for law
enforcement and intelligence organizations in their effort to detect
hidden groups in an organization which are engaged in illegal activ-
ities. All these advantages make email a perfect test bed for relevant
research like the study of information ow in an organization.
The study of information ow in an organization is germane to
issues of productivity, ef ciency and drawing some useful conclu-
sion about the business processes of the organization. It can lead to
insights on interaction patterns of employees within an organization
at different levels of the organization hierarchy. Most of the experi-
ments in this domain are performed on synthetic data due to lack of
an adequate or real life benchmark. The recent availability of large
datasets of human interaction like the Enron email dataset can be
a touchstone for such research. This dataset shows intercommuni-
cation between employees of an organization hence it is perfect to
study ow of information in an organization. This dataset is also
similar to the kind of data collected for fraud detection or counter
terrorism and hence it is a perfect test bed for testing effectives of
techniques used for fraud detection and counter terrorism.
In this paper we adopt event based graph entropy (we refer to
this as both event based graph entropy or graph entropy ) to
determine the most prominent yet interesting people in the Enron
email dataset.
The rest of this paper is organized as follows. We begin with
the problem of order in networks. Next, we describe our novel
event based graph entropy model. At the end, we report our results
of exploitation of such techniques on Enron dataset followed by
related work and conclusion remarks.
2. ORDER IN NETWORKS
Most of the work in SNA nd LD represent their environment with
a graph or network. We use both terms in this paper frequently. The
question is what sort of mathematical model would work best. One
way to describe a threat organization, or a social network is in terms
of a graph. In this model, each node would represent an individual
member and an edge linking two nodes would indicate direct com-
munication between those two members. Mathematically, we may
ask how many nodes must we remove from a given graph before it
splits into two or more separate sub-graphs? For graphs of various
sorts, it’s possible to estimate the probability that the removal of
a certain number of nodes would split the graph into two or more
separate units based on a set of policies and criterias. However, a
M1 M2 M3
F1 F2 F3
F5 L2 F4
M1 M2 M3
F1 F2 F3
F5 L2 F4
M1 M2 M3
F1 F2 F3
Figure 1: Leaders and Followers Example. L2 is leader, M1,
M2 and M3 are middlemen; F1, F2, F3, F4 and F5 are follow-
ers. Up: example of a network. As it shows removing M2 splits
the graph to three disconnected subgraphs Down: the same net-
work after information about leaders,middlemen and followers.
As it shows even though M2 splits the graph to three discon-
nected subgraphs, there are at least 2 pathes form Leaders to
followers while removing L1 destroy such path.
graph model might not be the best representation of organizations
such as drug dealers, terrorist organization and threat groups. In
his recent work, Jonathan Farley explains clearly [6] that modelling
terrorist networks as graphs does not give us enough information to
deal with the threat. Modelling these networks as graphs ignores
an important aspect of their structure, their hierarchy, and the fact
that they are composed of leaders and followers. Hence, it is not
enough to split the network since the remnant may contain a leader
and enough followers to pursue their plans. [6] assume the net-
work structure is known and authors try to nd the optimum way to
disrupt communication between leaders and followers. However,
in our work we try to identify those important nodes as much as
possible.
Figure 1 illustrates an example of such a phenomenon. The
graph in the left shows a network consisting of three leaders: L1,
L2 and L3; three middlemen: M1, M2 and M3; and three fol-
lowers F1, F2 and F3. The graph in the right illustrates the same
network without M2. As it is clearly seen that though such a re-
moval splits the graph into two separate remnants, each sub-graph
has leaders, middlemen and followers to carry orders and and ex-
ecute the plan. Hence in this type of networks the relationship of
one individual to another in a network becomes important. Leaders
are represented by the topmost nodes in a diagram of the ordered
set representing a network and followers are nodes at the bottom.
Disrupting the organization would be equivalent to disrupting the
chain of command, which allows orders to pass from leaders to
followers.
Hence, the interesting problem here is to determine important
nodes or leaders in a network. In other word, we are looking for
those nodes whose removal has the maximum effect on the com-
mand chain.
3. GRAPH ENTROPY
We assume we have an evidence database (EDB) full of trans-
actions among individual such as email, Phone Call etc. After ex-
ploiting the various explicit and implicit evidence fragments given
in the EDB, we try to identify prominent members in a graph by
looking at their transactions with others. To nd prominent peo-
ple in a network, we need to aggregate links between them and
discover which node has the most effect on such a network. The
entropy model can identify an entity or a set of entities which has
the most effect on the graph entropy and thus provide a ranked list
based on such effect. To do this we need to exploit facts such as in-
dividuals sharing the same property (e.g., having the same address)
or transactions like being involved in the same action (e.g., sending
email). Since such information is usually recorded by an observer
we refer to it as evidence. Without loss of generality we only focus
on individuals’ actions in this paper, but not on their properties.
We transform the problem space into a multigraph G =< V,E >
in which each node represents an entity (such as a person or orga-
nization) and each link (edge) between two entities represents an
action they are involved in. The term multigraph refers to a graph
in which multiple edges between nodes are either permitted. For
abstraction we summarize the set of actions (e.g., emails, phone
calls etc in each edge and refer them as link) . Hence each link
represent a set of actions in a vector. For instance an edge e7 could
be a set of two actions as e7 = [a2, a5]. Also please note that it’s
possible to distinguish between email sender and receiver.
V = {v1, . . . , v|V |} Number of vertices
E = {e1, . . . , e|E|} Number of edges
A = {a1, . . . , a|A|} Type of actions
The EDB consist of tables representing individuals and actions
among them at a given time. The table in Figure 2 shows an exam-
ple of such data.
Assume we have a small society of 4 people who have been in
contact with each other through actions. Figure 2 shows an ex-
ample of such a database. There are four people and three possi-
ble actions: sending Email ,making aPhone Call and participating
in aMeeting. When a person is not involved in any of the above-
mentioned actions at a particular time we show with action ϕ.
Hence V = {v1, v2, v3, v4} , E = {e1, e2, e3e4} and A =
{Email, phoneCall,Meeting, ϕ} . For the matter of represen-
tation we show A as A = {E,C,M,ϕ} . The table in 2 illustrates
actions among these individuals along with the action time.
This graph has a major conceptual difference with well-known
Bayesian and other similar graphical representations. Unlike such
conventional techniques in which nodes are variables and links are
statistical relation among variables (causal relations), here nodes
represent entities, and links are relations among entities.
3.1 Graph Entropy
There is no commonly used de nition of graph entropy. Indeed,
one can de ne the graph entropy as the Kolmogorov complexity of
its adjacency matrix and one can even use this de nition to obtain
interesting theoretical bounds for several important graph charac-
teristics , but the Kolmogorov complexity is incomputable [4].
In the following we adopt the notion of graph entropy which is
equal to Korner de nition [13]of graph entropy when the graph is
complete. Korner de nition of graph entropy also has this limi-
tation that all elements of the graph are being emitted by a dis-
crete memoryless and stationary information source according to
the probability distribution P . We show how we add memory
to graph entropy de nition by looking at sequences with length
greater than 1.
Korner gave several descriptions of graph entropy H(G, p) in-
cluding the following.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


