Failing signs of OCR
Parser is a library I have build for splitting A-level question papers and reattaching portions of the a paper according to their question number. It works as expected on a multiple choice question paper which was initially used to build the whole thing -- but it fails with other question papers which is weird since every question paper uses the same format created by Cambridge International Examination.
I initially thought the whole problem is with the detection of question number but it's more like a major issue with the reattachment module which -- is pretty hacked together in my opinion -- it's not really designed to scale but I didn't expect it to be this worse.
I am considering rewriting the entire reattachment module and testing it on a variety of question papers. let's see how it goes.
Okay so I was wrong, the problem is in fact with the detection module (which detects the question numbers from a page), not with the reattachment module. Still looking why is the program overlapping multiple questions. Turns out the hacked-together thing actually works better than the out-sourced library itself.
Update 1
Maybe throwing in the Keras MNIST AI to classify question numbers might do the trick here -- probably better than the OCR library I am using right now. But first, I need to pinpoint that the main problem is really with the OCR and nothing else. Otherwise, I will be spending useless time working on an AI which produces the exact output as this one is currently doing.
Yep, I am pretty sure the problem is with the OCR. The detection module is working well but the OCR library is messing up. So, I am thinking of building an AI that does it for me -- or perhaps hash out my old codes and hack together a simple one.
Update 2
This problem is pretty simple. Keras documentation already gives a tutorial for building a simple AI which can classify digits which I can use to check whether my detected image has a number in it or not and that way, I don't have to use OCR. Perhaps it might slow down the program but in the long run, this is much more reliable than this OCR library I am using right now.
There is a problem -- we cannot use the MNIST dataset -- which would have made our job a whole lot easier because the digits range from 0 to 9 but we need the upper bound to be at least 100. I guess we can create a dataset for ourself but that is gonna take a lot of time.
Well, I guess I do not see any easier solution than this. I might have to spend another day building the AI. I could not find any resource on the internet which could lead me to my preferred dataset. So, I am going to embark upon building my own -- again.