Unsupervised morphological segmentation and clustering with document boundaries

  • Moon T
  • Erk K
  • Baldridge J
  • 24


    Mendeley users who have this article in their library.
  • 4


    Citations of this article.


Many approaches to unsupervised morphology acquisition incorporate the frequency of character sequences with respect to each other to identify word stems and affixes. This typically involves heuristic search procedures and calibrating multiple arbitrary thresholds. We present a simple approach that uses no thresholds other than those involved in standard application of chi-2 significance testing. A key part of our approach is using document boundaries to constrain generation of candidate stems and affixes and clustering morphological variants of a given word stem. We evaluate our model on English and the Mayan language Uspanteko; it compares favorably to two benchmark systems which use considerably more complex strategies and rely more on experimentally chosen threshold values.

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document


  • Taesun Moon

  • Katrin Erk

  • Jason Baldridge

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free