Some experiments in the generation of word and document associations

Gerard Salton

Conference Proceedings

Some experiments in the generation of word and document associations

Salton G

AFIPS Conference Proceedings - 1962 Fall Joint Computer Conference, AFIPS 1962 (1962) 234-250

DOI: 10.1145/1461518.1461544

15Citations

12Readers

Get full text

Abstract

The solution of most problems in automatic information dissemination and retrieval is dependent on the availability of methods for the automatic analysis of information content. In most proposed automatic systems, this analysis is based on a counting procedure which uses the frequency of occurrence of certain words or word classes to generate sets of index terms, to prepare automatic abstracts or extracts, to deter-mine certain word groupings, and to extend or modify in various ways sets of terms originally given. Unfortunately, it is not possible to perform completely effective subject analyses solely by frequency counting techniques. Two automatic methods are presented to aid in an effective subject analysis. The first makes use of a simplified form of syntactic analysis to determine associations between words in a text, and the second uses bibliographic citations to classify documents into subject areas. Neither method requires extensive dictionaries or tables of the type normally used for automatic classification schemes; instead, information is extracted from certain function words, from suffixes provided with many words in the language, and from bibliographic citations already available with most documents. Specifically, the syntactic analysis makes use of a small dictionary of a few hundred function words such as prepositions, conjunctions, articles, and certain nouns. Word suffixes are then isolated, and a suffix table is used to obtain additional grammatical indicators. A type of predictive analysis is then used to assign syntactic function indicators to all words in a sentence by matching predicted syntactic structures against the available grammatical information for the various words. If no grammatical information is available, the most likely prediction is used to classify the given word. The syntactic function indicators are used to group words into phrases of certain types, and phrases into clauses, and to determine certain word associations. Experimental evidence indicates that the error rate is not substantially higher than that found in other more complicated syntactic analysis programs which require full syntactic word dictionaries. The citation matching program uses bibliographic citations to determine document similarities. A similarity coefficient is first calculated for all document pairs as a function of the number of overlapping citations between them. A second similarity coefficient is then derived using this time the number of overlapping index terms as a criterion. The index terms maybe generated by hand, or may be derived by means of word frequency analyses. Finally, similarity coefficients derived from overlapping citations are compared with those derived from overlapping index terms. The coefficients, computed for a sample document collection, are analyzed to verify the hypothesis that when a closeness exists in the subject matter of certain documents, as reflected by overlapping index terms, there exists a corresponding closeness in the citation sets. It is found that the computed similarity coefficients are much larger than those obtained by assuming a random assignment of citations and index terms. Suggestions are made for using citation sets as an aid to the automatic generation of index terms.

Cite

CITATION STYLE

APA

Salton, G. (1962). Some experiments in the generation of word and document associations. In AFIPS Conference Proceedings - 1962 Fall Joint Computer Conference, AFIPS 1962 (pp. 234–250). Association for Computing Machinery, Inc. https://doi.org/10.1145/1461518.1461544

Some experiments in the generation of word and document associations

Abstract

Cite

Register to see more suggestions