Latent Semantic Analysis Article Index for
Latent
Website Links For
Semantic
 

Information About

Latent Semantic Analysis





APPLICATIONS

Applications of LSA include the and Polysemy :
  • In synonymy, different writers use different words to describe the same idea. Thus, a person issuing a query in a search engine may use a different word than appears in a document, and may not retrieve the document.

  • In polysemy, the same word can have multiple meanings, so a searcher can get unwanted documents with the alternate meanings



OCCURRENCE MATRIX

LSA uses a Term-document Matrix which describes the occurrences of terms in documents; it is a Sparse Matrix whose rows correspond to documents and whose columns correspond to Term s, typically Stemmed words that appear in the documents. A typical example of the weighting of the elements of the matrix is Tf-idf : the element of the matrix proportional to the number of times the terms appear in each document, where rare terms are upweighted to reflect their relative importance.

This matrix is common to standard semantic models as well (though it is not necessarily explicitly expressed as a matrix, since the mathematical properties of matrix are not always used).


RANK LOWERING

After the construction of the occurrence matrix LSA finds a low- Rank approximation to the Term-document Matrix . The reasons for the approximations can have various explanations:

  • The original term-document matrix is presumed too large for the computing resources; in this point of view, the approximated matrix is interpreted as an ''approximation'' (a "least and necessary evil")

  • The original term-document matrix is presumed ''noisy'': for instance, anecdotal instances of terms are to be eliminated. From this point of view, the approximated matrix is interpreted as a ''de-noisified matrix'' (a better matrix than the original).

  • The original term-document matrix is presumed overly Sparse relative to the "true" term-document matrix. That is, the original matrix lists only the words actually ''in'' each document, whereas we might be interested in all words ''related to'' each document--generally a much larger set due to Synonymy .


The consequence of the rank lowering is that some dimensions get "merged":

  • car + 0.2828 --- truck), (flower)}


This mitigates synonymy, as the rank lowering is expected to merge the dimensions associated with terms that have similar meanings. It also mitigates polysemy, since components of polysemous words that point in the "right" direction are added to the components of words that share this sense. Conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense.


IMPLEMENTATION

Concretely, the downsizing of the matrix is often achieved through the use of Singular Value Decomposition (SVD): the set of all the terms is then represented by a vector space of lower dimensionality than the total number of terms in the vocabulary.

The SVD is typically computed using large matrix methods (for example, Lanczos Methods ) but may also be computed incrementally and with greatly reduced resources via a Neural Network -like approach which does not require the large, full-rank matrix to be held in memory {Link without Title} .


LIMITATIONS OF LSA

LSA features a number of drawbacks:

  • The resulting dimensions might be difficult to interpret. For instance, in

  • car + 0.2828 --- truck), (flower)}

  • car + 0.2828 --- truck) component could be interpreted as "vehicle". However, it is very likely that cases close to

  • car + 0.2828 --- bottle), (flower)}

  • :will occur. This leads to results which can be justified on the mathematical level, but have no interpretable meaning in natural language.




SEE ALSO



EXTERNAL LINKS AND REFERENCES