Skewed Predictions
Here's what I wanna do today: I will build a program that stores every word frequency of every chapter and stores them in a NumPy array for future loading for testing purposes. I will load all of them into a program and feed it a past paper question and see how well it predicts the topic. Let's do this. I will come back later.
Update 1
Nope! Storing the dictionary as a NumPy array is a bad idea because it doesn't work. It's supposed to store the array. So, the best idea would be to store it as a JSON file so it will have less file size as well can be loaded in other languages as well.
Update 2
Something weird I saw while examining one of the questions. So basically, the questions had a lot of references to electronics but the main question was regarding speed and acceleration. The program predicted that the question comes from the electronics chapter rather than the speed chapter. Interestingly, the score for speed was a little bit less than electronics. I think what must have happened is that the word frequency of one of the words from the question was crazy high in electronics which skewed the final results but the number of words found in the speed chapter was higher than in the electronics chapter. I think I will make some changes in the program to take into account the occurrences as well.
The system works. Well, not perfect but good enough for the first phase.
It crashes a lot too because it's got encoding issues with the pdf processor which I do not know how to solve at the moment. Maybe I will find something in the future.