Encoding Issues

Since the beginning of this project, I have been having the issue of encoding the pdf files due to which some words turn into gibberish. I think I've finally found a replacement for that -- a replacement library, pdfminer, it's simple and better than pypdf2. Although pypdf2 is popular around the community, it is simply riddled with errors for which the creator has no answers. So, the best path, for now, is to use the pdfminer.

PyPDF2 does not work.

Not only does it load a pdf file with wrong encoding which can't be fixed and thus producing gibberish, but it also has no consistency of printing using the same format with other different pdfs.

So, my system's main tool is to classify questions into their topical categories for which I need to use a pdf processing library but this library has caused me all sorts of trouble. But finally, I've found its successor (for now) -- pdfminer. Although this library is not very easy to use, has tons of bugs, and isn't well documented, it can at least load any pdf file with 100% accurate characters which is the main goal here. it's very accurate but slower than PyPDF2. To be honest, being fast isn't a priority here so I am okay with it.

The main task for me right now is to categorize the contents with their respective page numbers. While hashing out chapter names is easier, extracting page numbers is a whole new problem. There's literally no direct way to take them out. But I do have an idea about approaching this problem which I will try tomorrow. I will explain it later because I am sleep deprived right now as I have been working since the morning on a school assignment and I am tired as hell, my eyes are half-closed with vision half-blurred while typing this so I am going to bed. Good night!

Previous
Previous

Word Frequency

Next
Next

Skewed Predictions