Information About

Tf-idf




The tf–idf weight (term frequency–inverse document frequency) is a weight often used in Information Retrieval and Text Mining . This weight is a statistical measure used to evaluate how important a word is to a Document in a collection or Corpus . The importance increases Proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the Corpus . Variations of the tf–idf weighting scheme are often used by Search Engine s as a central tool in scoring and ranking a document's Relevance given a user Query .

The ''term frequency'' in the given document is simply the number of times a given Term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term frequency regardless of the actual importance of that term in the document) to give a measure of the importance of the term t_{i} within the particular document.

\mathrm{tf_i} = rac{n_i}{\sum_k n_k}

where n_i is the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms.

The ''inverse document frequency'' is a measure of the general importance of the term (obtained by dividing the number of all Documents by the number of documents containing the term, and then taking the Logarithm of that Quotient ).