Tuesday, August 4, 2009

Thoughts on TF-IDF

Information theoretically, information content of an event is proportional to the logarithm of the reciprocal of the probability of the event. For example, consider an event with 3 possible outcomes, with probabilities 1/8, 1/4, 5/8 respectively. What result will surprise you the most ? Obviously it is the first outcome. In this modern era of information explosion, quality information is the wealth. The more surprising the information, the more the quality. The lesser the probability the greater the surprise. But doubling the surprise may not double the information content. Claude Shannon in 1950's came up with the definition of the information content in terms of the surprise. He argued that the information content grows logarithmically with the surprise factor. For example, to double the information content, you have to make the result four times more surprising. This formulation is a generalization of the number of bits needed to encode a number to be transmitted over wire. For example, if there are four possible outcomes, they can be labelled 0-3 and only two bits are needed to represent those.

In text mining, statistical approaches consider text as a bag of words. Typically each word is associated with a weight which is a measure of the importance of the word in the given text. Now let's ponder about the importance of a word in text. A word is really important in a text, if the word occurs only in that text. A word is not at all important if it occurs in all the texts in the corpus. In general, as the number of documents in which the word appears increases, the weight of the word in a single document decreases. Ignoring the number of occurrences of a word in a text, the ratio of the number of documents in which the word appears to the total number of documents in the corpus is a measure of the probability of the word. Now, how much surprising the is the presence of a word in document ? And what about the information content?

Inverse Document Frequency of a word in a corpus is defined as the logarithm of the ratio between the total number of documents in the corpus and the number of documents containing the word. Ah, it is exactly same as the definition of the information content contributed by the word to the document!.

Now what about the total information content of the document. Every term contributes information to the document equal to the IDF of the term. Now, if there are 10 terms in the document, the total information content would be the sum of all these contributions. What if some terms are repeating, just multiple the contribution that many times. Well, that doesn;t look elegent to say that repeating a word doubles its contribution, thought that is how this theory works. In short, the contribution of a term to the information content of the document is the product of term count in the document and the IDF of the term.

Now, let's alternatively define information content as follows. As before, every term contributes information which is equal to IDF of the term. Instead of summing all the term contribution, let's take the average. The information content in the document with 10 terms would be the average of individual term contributions. We can easily verify that the contribution of a term to the document information content would be the the product of the ratio of the term frequency, TF and IDF, where term frequency is the ratio of the term count and the document length. This quantity is what is referred to as TF-IDF weight of the term to the docuement. It is a measure of the share of information provided by the term to the document.

Friday, May 8, 2009

Not so elegant google result

Google's search algorithm suite went totally wrong here. It not only fetched a completely irrelevent page, it ranked the page first. When i gave the query first up, the page was ranked third or fourth. The Second time around, the rank of the page increased for some reason and became firs!! Any explanation?