Directory

Encyclopedia

NodeWorks
                              ENCYCLOPEDIA

Link Checker

Home
Encyclopedia : L : LA : LAT :

Latent semantic analysis

 

Latent semantic analysis

Latent semantic analysis (LSA) is a technique in information retrieval invented in 1990 [1]. It is sometimes called latent semantic indexing (LSI).

LSA is a preprocessing step, used before the classification or search of documents. The purpose of LSA is to make documents easier to classify and search. LSA is meant to solve two fundamental problems in natural language processing: synonymy and polysemy. In synonymy, different writers use different words to describe the same idea. Thus, a person issuing a query in a search engine may use a different word than appears in a document, and may not retrieve the document. In polysemy, the same word can have multiple meanings, so a searcher can get unwanted documents with the alternate meanings.

LSA starts with a document-term matrix, a sparse matrix whose rows correspond to documents and whose columns correspond to terms (typically stemmed words that appear in the documents). The values of the matrix are typically tf-idf: they are proportional to the number of times the terms appear in the matrix, where rare terms are upweighted to reflect their relative importance.

LSA then finds a low-rank approximation to the document-term matrix, through the use of singular value decomposition (SVD). In LSA, this SVD is truncated, so that each document and term is represented by a vector of much lower dimensionality than the total number of words in the vocabulary. Thus, when a query is issued by a user, it gets mapped into this low-dimensional space, and gets compared to documents in that same space.

Because it uses a low-dimensional representation for terms and documents, it must represent meaning in documents, rather than simply which terms occur. Thus, document and terms with similar meaning are close in the low-dimensional space. This can mitigate polysemy (by using more than one word in the query to disambiguate in the low-dimensional space) and synonymy (because the synonymous words map similarly in the low-dimensional space).

Recently, LSA has come under criticism, because its probabilistic model does not match the observed data. LSA assumes that words and documents form a joint Gaussian model. However, Gaussian models can generate negative values, and it is impossible to have a negative number of words in a document. Thus, a newer alternative is probabilistic latent semantic analysis, based on a multinomial model, which is reported to give better results than standard LSA. However, LSA still remains a standard algorithm in information retrieval.

External links and references

  • the first place to start with LSA
  • Introduction to Latent Semantic Analysis, by T. K. Landauer, P. W. Foltz, & D. Laham, Discourse Processes, 25, 259-284 (1998).
  • Indexing by Latent Semantic Analysis, by S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, R. Harshman, Journal of the Society for Information Science, 41(6), 391-407, (1990).
  • Probabilistic Latent Semantic Analysis, by T. Hofmann, Proc. Uncertainty in Artificial Intelligence, (1999)


  • NodeWorks boosts web surfing!
    Page Returned in 2.467 seconds - HTML Compressed 67.6%

    This article is from Wikipedia. All text is available
    under the terms of the GNU Free Documentation License.
     GNU Free Documentation License
    © 2008 Chamas Enterprises Inc.