Investigating Unstructured Texts with Latent Semantic Analysis
Available from
Fridolin Wild's profile on Mendeley.
Page 1
Investigating Unstructured Texts with Latent Semantic Analysis
Investigating Unstructured Texts
with Latent Semantic Analysis
Fridolin Wild, Christina Stahl
Institute for Information Systems and New Media,
Vienna University of Economics and Business Administration,
Augasse 2-6, A-1090 Vienna, Austria, {firstname.surname}@wu-wien.ac.at
Abstract. Latent semantic analysis (LSA) is an algorithm applied to approximate
the meaning of texts, thereby exposing semantic structure to computation. LSA
combines the classical vector-space model well known in computational linguis-
tics with a singular value decomposition (SVD), a two-mode factor analysis.
Thus, bag-of-words representations of texts can be mapped into a modied vector
space that is assumed to reect semantic structure. In this contribution the authors
describe the lsa package for the statistical language and environment R and illus-
trate its proper use through examples from the areas of automated essay scoring
and knowledge representation.
1 Introduction to Latent Semantic Analysis
Derived from latent semantic indexing, LSA is intended to enable the analysis
of the semantic structure of texts. The basic idea behind LSA is that the collo-
cation of terms of a given document-term vector space reects a higher-order
latent semantic structure, which is obscured by word usage (e.g., by
synonyms or ambiguities). By using conceptual indices that are derived statis-
tically via a truncated singular value decomposition, this variability problem
is believed to be overcome (Deerwester et al. (1990)).
In a typical LSA process, rst a document-term matrix M is constructed
from a given text base of n documents containing m terms. The term 'textma-
trix` will be used throughout the rest of this contribution to denote this type of
document-term matrices. This textmatrix M of the size m×n is then resolved
by the singular value decomposition into the term-vector matrix T (consti-
tuting the left singular vectors), the document-vector matrix D (constituting
the right singular vectors) being both orthonormal and the diagonal matrix
S. These matrices are then reduced to a particular number of dimensions k,
giving the truncated matrices Tk, Sk and Dk the latent semantic space.
Multiplying the truncated matrices Tk, Sk and Dk results in a new matrix Mk
which is the least-squares best t approximation of M with k singular values.
with Latent Semantic Analysis
Fridolin Wild, Christina Stahl
Institute for Information Systems and New Media,
Vienna University of Economics and Business Administration,
Augasse 2-6, A-1090 Vienna, Austria, {firstname.surname}@wu-wien.ac.at
Abstract. Latent semantic analysis (LSA) is an algorithm applied to approximate
the meaning of texts, thereby exposing semantic structure to computation. LSA
combines the classical vector-space model well known in computational linguis-
tics with a singular value decomposition (SVD), a two-mode factor analysis.
Thus, bag-of-words representations of texts can be mapped into a modied vector
space that is assumed to reect semantic structure. In this contribution the authors
describe the lsa package for the statistical language and environment R and illus-
trate its proper use through examples from the areas of automated essay scoring
and knowledge representation.
1 Introduction to Latent Semantic Analysis
Derived from latent semantic indexing, LSA is intended to enable the analysis
of the semantic structure of texts. The basic idea behind LSA is that the collo-
cation of terms of a given document-term vector space reects a higher-order
latent semantic structure, which is obscured by word usage (e.g., by
synonyms or ambiguities). By using conceptual indices that are derived statis-
tically via a truncated singular value decomposition, this variability problem
is believed to be overcome (Deerwester et al. (1990)).
In a typical LSA process, rst a document-term matrix M is constructed
from a given text base of n documents containing m terms. The term 'textma-
trix` will be used throughout the rest of this contribution to denote this type of
document-term matrices. This textmatrix M of the size m×n is then resolved
by the singular value decomposition into the term-vector matrix T (consti-
tuting the left singular vectors), the document-vector matrix D (constituting
the right singular vectors) being both orthonormal and the diagonal matrix
S. These matrices are then reduced to a particular number of dimensions k,
giving the truncated matrices Tk, Sk and Dk the latent semantic space.
Multiplying the truncated matrices Tk, Sk and Dk results in a new matrix Mk
which is the least-squares best t approximation of M with k singular values.
Page 2
2 Wild, Stahl
Mk is of the same format as M , i.e., rows represent the same terms, columns
the same documents.
To keep additional documents from inuencing a previously calculated se-
mantic space or to simply re-use the structure contained in an already existing
factor distribution, new documents can be folded-in after the singular value
decomposition. For this purpose, the add-on documents can be added to the
pre-exisiting latent semantic space by mapping them into the existing factor
structure. Moreover, folding-in is computationally a lot less costly, as no sin-
gular value decomposition is needed. To fold-in, a pseudo-document vector mˆ
needs to be calculated in three steps (Berry et al. (1995)): after constructing
a document vector v from the additional documents containing the term fre-
quencies in the exact order constituted by the input textmatrix M , v can be
mapped into the latent semantic space by applying (1) and (2).
dˆ = vT TkS−1k (1)
mˆ = TkSkdˆ (2)
Thereby, Tk and Sk are the truncated matrices from the previously calcu-
lated latent semantic space. The resulting vector dˆ of Equation (1) represents
an additional column of Dk. The resulting pseudo-document vector mˆ from
Equation (2) is identical to an additional column in the textmatrix represen-
tation of the latent semantic space.
2 Inuencing Parameters
Several classes of adjustment parameters can be functionally dierentiated in
the latent semantic analysis process. Every class introduces new parameter
settings that drive the eectiveness of the algorithm. The following classes
have been identied so far by Wild et al. (2005): textbase compilation and
selection, preprocessing methods, weighting schemes, choice of dimensionality,
and similarity measurement techniques (see Figure 1).
Dierent texts create a dierent factor distribution. Moreover, texts may
be splitted into components such as sentences, paragraphs, chapters, bags-of-
words of a xed size, or even into context bags around certain keywords. The
document collection available may be ltered according to specic criteria
such as novelty or sampled into a random sample, so that only a subset of the
existing documents will actually be used in the latent semantic analysis. The
textbase compilation and selection options form one class of parameters.
Document preprocessing comprises several operations performed on the
input texts such as lexical analysis, stop-word ltering, reduction to word
stems, ltering of keywords above or below certain frequency thresholds, and
the use of controlled vocabularies (Baeza-Yates (1999)).
Weighting schemes have been shown to signicantly inuence the eec-
tiveness of LSA (Wild et al. (2005)). Weighting schemes in general can be
Mk is of the same format as M , i.e., rows represent the same terms, columns
the same documents.
To keep additional documents from inuencing a previously calculated se-
mantic space or to simply re-use the structure contained in an already existing
factor distribution, new documents can be folded-in after the singular value
decomposition. For this purpose, the add-on documents can be added to the
pre-exisiting latent semantic space by mapping them into the existing factor
structure. Moreover, folding-in is computationally a lot less costly, as no sin-
gular value decomposition is needed. To fold-in, a pseudo-document vector mˆ
needs to be calculated in three steps (Berry et al. (1995)): after constructing
a document vector v from the additional documents containing the term fre-
quencies in the exact order constituted by the input textmatrix M , v can be
mapped into the latent semantic space by applying (1) and (2).
dˆ = vT TkS−1k (1)
mˆ = TkSkdˆ (2)
Thereby, Tk and Sk are the truncated matrices from the previously calcu-
lated latent semantic space. The resulting vector dˆ of Equation (1) represents
an additional column of Dk. The resulting pseudo-document vector mˆ from
Equation (2) is identical to an additional column in the textmatrix represen-
tation of the latent semantic space.
2 Inuencing Parameters
Several classes of adjustment parameters can be functionally dierentiated in
the latent semantic analysis process. Every class introduces new parameter
settings that drive the eectiveness of the algorithm. The following classes
have been identied so far by Wild et al. (2005): textbase compilation and
selection, preprocessing methods, weighting schemes, choice of dimensionality,
and similarity measurement techniques (see Figure 1).
Dierent texts create a dierent factor distribution. Moreover, texts may
be splitted into components such as sentences, paragraphs, chapters, bags-of-
words of a xed size, or even into context bags around certain keywords. The
document collection available may be ltered according to specic criteria
such as novelty or sampled into a random sample, so that only a subset of the
existing documents will actually be used in the latent semantic analysis. The
textbase compilation and selection options form one class of parameters.
Document preprocessing comprises several operations performed on the
input texts such as lexical analysis, stop-word ltering, reduction to word
stems, ltering of keywords above or below certain frequency thresholds, and
the use of controlled vocabularies (Baeza-Yates (1999)).
Weighting schemes have been shown to signicantly inuence the eec-
tiveness of LSA (Wild et al. (2005)). Weighting schemes in general can be
Page 3
Latent Semantic Analysis 3
pre-
processing
options:
- stemming
- stopword filtering
- global or local
frequency band-
width channel
- controlled
vocabulary
- raw
weighting dimensionality
local weights:
- none (raw)
- binary tf
- log tf
global weights:
- none (raw)
- normalisation
- idf
- 1+entropy
singular values k:
- coverage
= 0.3, 0.4, 0.5
- coverage
>= ndocs
- 1/30
- 1/50
- magic 10
- none (vector m.)
similarity
measurement
correlation
measure:
- pearson
- spearman
- cosine
method:
- best hit
- mean of best
textbase compilation
and selection
options:
- documents
- chapters
- paragraphs
- sentences
- context bags
- number of docs
Fig. 1. Parameter classes inuencing the algorithm eectiveness.
dierentiated into local (lw) and global (gw) weighting schemes, which may
be combined as follows:
mˇ = lw(m) · gw(m) (3)
Local schemes only take into account term frequencies within a particular
document, whereas global weighting schemes relate term frequencies to the
frequency distribution in the whole document collection. Weighting schemes
are needed to change the impact of relative and absolute term frequencies to,
e.g., emphasize medium-frequency terms as they are assumed to be most repre-
sentative for the documents described. Especially when dealing with narrative
text, high-frequency terms are often semantically meaningless functional terms
(e.g., 'the', 'it') whereas low-frequency terms can in general be considered to
be distractors generated, for example, through the use of metaphors. See
Section 3 for an overview on common weighting mechanisms.
The choice on the ideal number of dimensions is responsible for the ef-
fect that distinguishes LSA from the pure vector-space model: if all dimen-
sions are used, the original matrix will be reconstructed and an unmodied
vector-space model is the basis for further processing. If less dimensions than
available non-zero singular values are used, the original vector space is ap-
proximated. Thereby, relevant structure information inherent in the original
matrix is captured, reducing noise and variability in word usage. Several meth-
ods to determine the optimal number of singular values to be used have been
proposed. Wild et al. (2005) report a new method for calculating the number
via a share between 30% and 50% of the cumulated singular values to show
best results.
How the similarity of document or term vectors is measured forms an-
other class of inuencing parameters. Both, the similarity measure chosen
and the similarity measurement method aects the outcomes. Various corre-
lation measures have been applied in LSA. Among others, these comprise the
simple crossproduct, the Pearson correlation (and the nearly identical cosine
measure), and Spearman's Rho. The measurement method can, for example,
pre-
processing
options:
- stemming
- stopword filtering
- global or local
frequency band-
width channel
- controlled
vocabulary
- raw
weighting dimensionality
local weights:
- none (raw)
- binary tf
- log tf
global weights:
- none (raw)
- normalisation
- idf
- 1+entropy
singular values k:
- coverage
= 0.3, 0.4, 0.5
- coverage
>= ndocs
- 1/30
- 1/50
- magic 10
- none (vector m.)
similarity
measurement
correlation
measure:
- pearson
- spearman
- cosine
method:
- best hit
- mean of best
textbase compilation
and selection
options:
- documents
- chapters
- paragraphs
- sentences
- context bags
- number of docs
Fig. 1. Parameter classes inuencing the algorithm eectiveness.
dierentiated into local (lw) and global (gw) weighting schemes, which may
be combined as follows:
mˇ = lw(m) · gw(m) (3)
Local schemes only take into account term frequencies within a particular
document, whereas global weighting schemes relate term frequencies to the
frequency distribution in the whole document collection. Weighting schemes
are needed to change the impact of relative and absolute term frequencies to,
e.g., emphasize medium-frequency terms as they are assumed to be most repre-
sentative for the documents described. Especially when dealing with narrative
text, high-frequency terms are often semantically meaningless functional terms
(e.g., 'the', 'it') whereas low-frequency terms can in general be considered to
be distractors generated, for example, through the use of metaphors. See
Section 3 for an overview on common weighting mechanisms.
The choice on the ideal number of dimensions is responsible for the ef-
fect that distinguishes LSA from the pure vector-space model: if all dimen-
sions are used, the original matrix will be reconstructed and an unmodied
vector-space model is the basis for further processing. If less dimensions than
available non-zero singular values are used, the original vector space is ap-
proximated. Thereby, relevant structure information inherent in the original
matrix is captured, reducing noise and variability in word usage. Several meth-
ods to determine the optimal number of singular values to be used have been
proposed. Wild et al. (2005) report a new method for calculating the number
via a share between 30% and 50% of the cumulated singular values to show
best results.
How the similarity of document or term vectors is measured forms an-
other class of inuencing parameters. Both, the similarity measure chosen
and the similarity measurement method aects the outcomes. Various corre-
lation measures have been applied in LSA. Among others, these comprise the
simple crossproduct, the Pearson correlation (and the nearly identical cosine
measure), and Spearman's Rho. The measurement method can, for example,
Page 4
4 Wild, Stahl
simply be a vector to vector comparison or the average correlation of a vector
with a particular vector set.
3 The lsa Package for R
In order to facilitate the use of LSA, a package for the statistical language
and environment R has been implemented by Wild (2005). The package is
open-source and available via CRAN, the Comprehensive R Archive Network.
A higher-level abstraction is introduced to ease the application of LSA.
Five core methods perform the direct LSA steps. With textmatrix(), a doc-
ument base can be read in from a specied directory. The documents are con-
verted to a textmatrix (i.e., document-term matrix, see above) object, which
holds terms in rows and documents in columns, so that each cell contains the
frequency of a particular term in a particular document. Alternatively, pseudo
documents can be created with query() from a given text string. The output
in this case is also a textmatrix, albeit it has only one column (the query).
By calling lsa() on a textmatrix, a latent semantic space is constructed,
using the singular value decomposition as specied in Section 1. The three
truncated matrices from the SVD are returned as a list object. A latent seman-
tic space can be converted back to a textmatrix object with as.textmatrix().
The returned textmatrix has the same terms and documents, however, with
modied frequencies, that now reect inherent semantic relations not explicit
in the original input textmatrix.
Additionally, the package contains several tuning options for the core rou-
tines and various support methods which help setting the inuencing param-
eters. Some examples are given below, for additional options see Wild (2005).
Considering text preprocessing, textmatrix() oers several argument op-
tions. Two stop-word lists are provided with the package, one for German lan-
guage texts (370 terms) and one for English (424 terms), which can be used
to lter terms. Additionally, a controlled vocabulary can be specied, sort
order will be sustained. Support for Porter's Snowball stemmer is provided
through interaction with the Rstem package (Lang (2004)). Furthermore, a
lower boundary for word lengths and minimum document frequencies can be
specied via an optional switch.
Methods for term weighting include the local weightings (lw) raw, log,
binary, and the global weightings (gw) normalisation, two versions of the
inverse document frequency (idf), and entropy in both the original Shannon
as well as in a slightly modied, more popular version (Wild (2005)).
Various methods for nding a useful number of dimensions are oered in
the package. A xed number of values can be directly assigned as an argu-
ment in the core routine. The same applies for the common practise to use a
xed fraction of singular values, e.g., 1/50th or 1/30th. Several support meth-
ods are oered to automatically identify a reasonable number of dimension:
a percentage of the cumulated values (e.g., 50%); equalling the number of
simply be a vector to vector comparison or the average correlation of a vector
with a particular vector set.
3 The lsa Package for R
In order to facilitate the use of LSA, a package for the statistical language
and environment R has been implemented by Wild (2005). The package is
open-source and available via CRAN, the Comprehensive R Archive Network.
A higher-level abstraction is introduced to ease the application of LSA.
Five core methods perform the direct LSA steps. With textmatrix(), a doc-
ument base can be read in from a specied directory. The documents are con-
verted to a textmatrix (i.e., document-term matrix, see above) object, which
holds terms in rows and documents in columns, so that each cell contains the
frequency of a particular term in a particular document. Alternatively, pseudo
documents can be created with query() from a given text string. The output
in this case is also a textmatrix, albeit it has only one column (the query).
By calling lsa() on a textmatrix, a latent semantic space is constructed,
using the singular value decomposition as specied in Section 1. The three
truncated matrices from the SVD are returned as a list object. A latent seman-
tic space can be converted back to a textmatrix object with as.textmatrix().
The returned textmatrix has the same terms and documents, however, with
modied frequencies, that now reect inherent semantic relations not explicit
in the original input textmatrix.
Additionally, the package contains several tuning options for the core rou-
tines and various support methods which help setting the inuencing param-
eters. Some examples are given below, for additional options see Wild (2005).
Considering text preprocessing, textmatrix() oers several argument op-
tions. Two stop-word lists are provided with the package, one for German lan-
guage texts (370 terms) and one for English (424 terms), which can be used
to lter terms. Additionally, a controlled vocabulary can be specied, sort
order will be sustained. Support for Porter's Snowball stemmer is provided
through interaction with the Rstem package (Lang (2004)). Furthermore, a
lower boundary for word lengths and minimum document frequencies can be
specied via an optional switch.
Methods for term weighting include the local weightings (lw) raw, log,
binary, and the global weightings (gw) normalisation, two versions of the
inverse document frequency (idf), and entropy in both the original Shannon
as well as in a slightly modied, more popular version (Wild (2005)).
Various methods for nding a useful number of dimensions are oered in
the package. A xed number of values can be directly assigned as an argu-
ment in the core routine. The same applies for the common practise to use a
xed fraction of singular values, e.g., 1/50th or 1/30th. Several support meth-
ods are oered to automatically identify a reasonable number of dimension:
a percentage of the cumulated values (e.g., 50%); equalling the number of
Page 5
Latent Semantic Analysis 5
documents with a share of the cumulated values; dropping all values below
1.0 (the so called `Kaiser Criterion'); and nally the pure vector model with
all available values (Wild (2005)).
4 Demonstrations
In the following section, two examples will be given on how LSA can be applied
in practise. The rst case illustrates how LSA may be used to automatically
score free-text essays in an educational assessment setting. Typically, if con-
ducted by teachers, essays written by students are marked through careful
reading and evaluation along specic criteria, among others their content.
`Essay' thereby refers to a test item which requires a response composed by
the examinee, usually in the form of one or more sentences, of a nature that
no single response or pattern of responses can be listed as correct (Stalnaker
(1951)).
construct latent
semantic space
convert
vectors fold-in
compare
doc
vectors
:
test essays
gold standard
essays
0.2
0.2
0.8
generic
documents
domain specific
documents
compare
term
vectors
:
0.7
0.3
0.5
convert to
textmatrix convert totextmatrix
Fig. 2. LSA Process for both examples.
When emulating human understanding with LSA, rst a latent seman-
tic space needs to be trained from domain-specic and generic background
documents. Generic texts thereby add a reasonably heterogeneous amount of
general vocabulary whereas the domain-specic texts provide the professional
vocabulary. The document collection is therefore converted into a textmatrix
object (see Figure 2, Step 1). Based on this textmatrix, a latent semantic space
is constructed in Step 2. Ideally, this space is an optimal conguration of fac-
tors calculated from the training documents and is able to evaluate content
similarity. To avoid the essays to be tested and a collection of best-practise
examples (so called `gold-standard essays') from inuencing this space, they
are folded in after the SVD. In Step A they are converted into a textmatrix
applying the vocabulary and term order from the textmatrix generated in Step
1. In Step B they are folded into this existing latent space (see Section 1).
documents with a share of the cumulated values; dropping all values below
1.0 (the so called `Kaiser Criterion'); and nally the pure vector model with
all available values (Wild (2005)).
4 Demonstrations
In the following section, two examples will be given on how LSA can be applied
in practise. The rst case illustrates how LSA may be used to automatically
score free-text essays in an educational assessment setting. Typically, if con-
ducted by teachers, essays written by students are marked through careful
reading and evaluation along specic criteria, among others their content.
`Essay' thereby refers to a test item which requires a response composed by
the examinee, usually in the form of one or more sentences, of a nature that
no single response or pattern of responses can be listed as correct (Stalnaker
(1951)).
construct latent
semantic space
convert
vectors fold-in
compare
doc
vectors
:
test essays
gold standard
essays
0.2
0.2
0.8
generic
documents
domain specific
documents
compare
term
vectors
:
0.7
0.3
0.5
convert to
textmatrix convert totextmatrix
Fig. 2. LSA Process for both examples.
When emulating human understanding with LSA, rst a latent seman-
tic space needs to be trained from domain-specic and generic background
documents. Generic texts thereby add a reasonably heterogeneous amount of
general vocabulary whereas the domain-specic texts provide the professional
vocabulary. The document collection is therefore converted into a textmatrix
object (see Figure 2, Step 1). Based on this textmatrix, a latent semantic space
is constructed in Step 2. Ideally, this space is an optimal conguration of fac-
tors calculated from the training documents and is able to evaluate content
similarity. To avoid the essays to be tested and a collection of best-practise
examples (so called `gold-standard essays') from inuencing this space, they
are folded in after the SVD. In Step A they are converted into a textmatrix
applying the vocabulary and term order from the textmatrix generated in Step
1. In Step B they are folded into this existing latent space (see Section 1).
Page 6
6 Wild, Stahl
As a very simple scoring method, the Pearson Correlation between the test
essays and the gold-standard essays can be used for scoring as indicated in
Step C. A high correlation equals a high score. See Listing 1 for the R code.
Listing 1. Essay Scoring with LSA
1 library ( " l s a " ) # load package
2
3 # load t r a i n i n g t e x t s
4 trm = textmatrix ( " t r a i n i n g t e x t s/" )
5 trm = lw_bintf ( trm) ∗ gw_idf ( trm) # weigh t ing
6 space = l sa ( trm) # crea t e LSA space
7
8 # fo ld−in t e s t and go l d s tandard e s say s
9 tem = textmatrix ( " e s says/" , vocabulary=rownames( trm ) )
10 tem = lw_bintf ( tem) ∗ gw_idf ( tem) # weigh t ing
11 tem_red = fold_in ( tem , space )
12
13 # score essay aga ins t go l d s tandard
14 cor ( tem_red [ , " gold . txt " ] , tem_red [ , "E1 . txt " ] ) # 0.7
The second case illustrates, how a space changes behavior, when both,
corpus size of the document collection and number of dimensions, are var-
ied. This example can be used for experiments investigating the two driving
parameters 'corpus size` and 'optimal number of dimensions`.
Fig. 3. Highly frequent terms `eu'
vs. `oesterreich' (Pearson)
dims
200
400
600
ndocs
200
400
600
co
r
0.0
0.5
1.0
Fig. 4. Highly frequent terms
`jahr' vs. `wien' (Pearson)
dims
200
400
600
ndocs
200
400
600
co
r
0.0
0.5
1.0
Therefore, as can be seen in Figure 2, a latent semantic space is con-
structed from a document collection by converting a randomised full sample
of the available documents to a textmatrix in Step 1 (see Listing 2, Lines 4-5)
and by applying the lsa() method in Step 2 (see Line 16). Step 3 (Lines 17-18)
converts the space to textmatrix format and measures the similarity between
As a very simple scoring method, the Pearson Correlation between the test
essays and the gold-standard essays can be used for scoring as indicated in
Step C. A high correlation equals a high score. See Listing 1 for the R code.
Listing 1. Essay Scoring with LSA
1 library ( " l s a " ) # load package
2
3 # load t r a i n i n g t e x t s
4 trm = textmatrix ( " t r a i n i n g t e x t s/" )
5 trm = lw_bintf ( trm) ∗ gw_idf ( trm) # weigh t ing
6 space = l sa ( trm) # crea t e LSA space
7
8 # fo ld−in t e s t and go l d s tandard e s say s
9 tem = textmatrix ( " e s says/" , vocabulary=rownames( trm ) )
10 tem = lw_bintf ( tem) ∗ gw_idf ( tem) # weigh t ing
11 tem_red = fold_in ( tem , space )
12
13 # score essay aga ins t go l d s tandard
14 cor ( tem_red [ , " gold . txt " ] , tem_red [ , "E1 . txt " ] ) # 0.7
The second case illustrates, how a space changes behavior, when both,
corpus size of the document collection and number of dimensions, are var-
ied. This example can be used for experiments investigating the two driving
parameters 'corpus size` and 'optimal number of dimensions`.
Fig. 3. Highly frequent terms `eu'
vs. `oesterreich' (Pearson)
dims
200
400
600
ndocs
200
400
600
co
r
0.0
0.5
1.0
Fig. 4. Highly frequent terms
`jahr' vs. `wien' (Pearson)
dims
200
400
600
ndocs
200
400
600
co
r
0.0
0.5
1.0
Therefore, as can be seen in Figure 2, a latent semantic space is con-
structed from a document collection by converting a randomised full sample
of the available documents to a textmatrix in Step 1 (see Listing 2, Lines 4-5)
and by applying the lsa() method in Step 2 (see Line 16). Step 3 (Lines 17-18)
converts the space to textmatrix format and measures the similarity between
Page 7
Latent Semantic Analysis 7
two terms. By varying corpus size (Line 9 and 10-13 for sanitising) and di-
mensionality (Line 15), behavior changes of the space can be investigated.
Figure 3 and Figure 4 show visualisations of this behavior data: the terms
of Figure 3 were considered to be highly associated and thus were expected
to be very similar in their correlations. Evidence for this can be derived from
the chart when comparing with Figure 4, visualising the similarities from a
term pair considered to be unrelated (`jahr' = `year', `wien' = `vienna'). In
fact the base level of the correlations of the rst, highly associated term pair
is visibly higher than that of the second, unrelated term pair. Moreover, at
the turning point of the cor-dims curves, the correlation levels have an even
increased distance which already stabilises for a comparatively small number
of documents.
Listing 2. The Geometry of Meaning
1 tm = textmatrix( " t e x t s/" , stopwords=stopwords_de)
2
3 # randomize document order
4 rndsample = sample ( 1 : ncol (tm) )
5 sm = tm [ , rndsample ]
6
7 # measure term−term s i m i l a r i t i e s
8 s = NULL
9 for ( i in ( 2 : ncol (sm) ) ) {
10 # f i l t e r out unused terms
11 i f (any(rowSums(sm [ , 1 : i ])==0)) {
12 m = sm[−(which(rowSums(sm [ , 1 : i ] )==0)) ,1 : i ]
13 } else { m = sm }
14 # increase dims
15 for (d in 2 : i ) {
16 space = l sa (m, dims=d)
17 redm = as . textmatrix( space )
18 s = c ( s , cor ( redm [ " jahr " , ] , redm [ "wien" , ] ) )
19 }
20 }
5 Evaluating Algorithm Eectiveness
Evaluating the eectiveness of LSA, especially with changing parameter set-
tings, is dependent on the application area targeted. Within an information
retrieval setting, the same results may lead to a dierent interpretation than
in an essay scoring setting. One evaluation option is to externally validate by
comparing machine behavior to human behavior (see Figure 5). For the essay
scoring example, the authors have evaluated machine against human scores,
nding a man-machine correlation (Spearman's Rho) of up to .75, signicant
at a level below .001 in nine exams tested. In comparison, human-to-human
two terms. By varying corpus size (Line 9 and 10-13 for sanitising) and di-
mensionality (Line 15), behavior changes of the space can be investigated.
Figure 3 and Figure 4 show visualisations of this behavior data: the terms
of Figure 3 were considered to be highly associated and thus were expected
to be very similar in their correlations. Evidence for this can be derived from
the chart when comparing with Figure 4, visualising the similarities from a
term pair considered to be unrelated (`jahr' = `year', `wien' = `vienna'). In
fact the base level of the correlations of the rst, highly associated term pair
is visibly higher than that of the second, unrelated term pair. Moreover, at
the turning point of the cor-dims curves, the correlation levels have an even
increased distance which already stabilises for a comparatively small number
of documents.
Listing 2. The Geometry of Meaning
1 tm = textmatrix( " t e x t s/" , stopwords=stopwords_de)
2
3 # randomize document order
4 rndsample = sample ( 1 : ncol (tm) )
5 sm = tm [ , rndsample ]
6
7 # measure term−term s i m i l a r i t i e s
8 s = NULL
9 for ( i in ( 2 : ncol (sm) ) ) {
10 # f i l t e r out unused terms
11 i f (any(rowSums(sm [ , 1 : i ])==0)) {
12 m = sm[−(which(rowSums(sm [ , 1 : i ] )==0)) ,1 : i ]
13 } else { m = sm }
14 # increase dims
15 for (d in 2 : i ) {
16 space = l sa (m, dims=d)
17 redm = as . textmatrix( space )
18 s = c ( s , cor ( redm [ " jahr " , ] , redm [ "wien" , ] ) )
19 }
20 }
5 Evaluating Algorithm Eectiveness
Evaluating the eectiveness of LSA, especially with changing parameter set-
tings, is dependent on the application area targeted. Within an information
retrieval setting, the same results may lead to a dierent interpretation than
in an essay scoring setting. One evaluation option is to externally validate by
comparing machine behavior to human behavior (see Figure 5). For the essay
scoring example, the authors have evaluated machine against human scores,
nding a man-machine correlation (Spearman's Rho) of up to .75, signicant
at a level below .001 in nine exams tested. In comparison, human-to-human
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
1 Reader on Mendeley
by Discipline
by Academic Status
100% Researcher (at an Academic Institution)
by Country
100% United Kingdom


