Directory

Encyclopedia

NodeWorks
                              ENCYCLOPEDIA

Link Checker

Home
Encyclopedia : T : TF : TFI :

TFIDF

 

TFIDF

TFIDF (Term Frequency Inverse Document Frequency) is a statistical technique used to evaluate how important a word is to a document. The importance increases proportionally to the number of times a word appears in the document but is offset by how common the word is in all of the documents in the collection or corpus. TFIDF is often used by search engines to find the most relevant documents to a user's query.

There are many different formulas used to calculate TFIDF. The term frequency (TF) is the number of times the word appears in a document divided by the number of total words in the document. If a document contains 100 total words and the word cow appears 3 times, then the term frequency of the word cow in the document is 0.03 (3/100). One way of calculating document frequency (DF) is to determine how many documents contain the word cow divided by the total number of documents in the collection. So if cow appears in 1,000 documents out of a total of 10,000,000 then the document frequency is 0.0001 (1000/10000000). The final TFIDF score is then calculated by dividing the term frequency by the document frequency. For our example, the TFIDF score for cow in the collection would be 300 (0.03/0.0001). Alternatives to this formula are to take the log2 of the document frequency.

External links

  • Term Weighting Approaches in Automatic Text Retrieval

  • NodeWorks boosts web surfing!
    Page Returned in 0.127 seconds - HTML Compressed 70.9%

    This article is from Wikipedia. All text is available
    under the terms of the GNU Free Documentation License.
     GNU Free Documentation License
    © 2008 Chamas Enterprises Inc.