Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding

  • Murtagh F
  • Downs G
  • Contreras P
  • 12


    Mendeley users who have this article in their library.
  • 19


    Citations of this article.


Coding of data, usually upstream of data analysis, has crucial impli- cations for the data analysis results. By modifying the data coding – through use of less than full precision in data values – we can aid appre- ciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierar- chically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second appli- cation, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of O(n log n) complexity, to k-means.

Author-supplied keywords

  • 05c05
  • 060676532
  • 1
  • 10
  • 1137
  • 62-07
  • 62h30
  • 62p30
  • 68p20
  • ams subject classifications
  • data properties
  • describe the problem and
  • doi
  • hashing
  • hierarchical clustering
  • in section 1 we
  • introduction
  • partitioning
  • tree distance
  • ultrametric

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document


  • Fionn Murtagh

  • Geoff Downs

  • Pedro Contreras

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free