Word Frequency

Uncategorized

Jun 25

Written By

I think the idea of using the frequency of a word in a document as a measure of its importance is weak because, for example, if one of my topics has a word "or" with the highest frequency, the chances of it being in the other topics are pretty high -- and also this is a rather common word in the English Language.

So, this particular word doesn't represent my topic which is useless so I need to find a way to take into account a word's occurrences in the entire document to calculate its importance.

There's an algorithm called Term Frequency Inverse Document Frequency which calculates a score for a word by taking into account its occurrence in a topic and all the other topics in the documents. This is exactly what I needed. So, using this algorithm, a word such as "or" would have a score of 5 while a specific word such as "relativity" would have a score of 85.

Beautiful Algorithm! Will implement this as soon as possible.

Well, maybe this was too early to chase after accuracy. What I had made before worked pretty well -- surprisingly well, in fact, I thought this is the baseline accuracy for the model. Since then, every reiteration of the model has gone down in terms of accuracy -- from 80% to 30%.

The system works on Computer Science and I bet it will work properly on Mathematics too. But I will do some more experimentation on Physics and Chemistry as they have more diverse and interrelated questions.

Word Frequency

Parsing Library

Encoding Issues