I recently did a project where I OCR'd a very rare book that I could only find in the library of congress so I could read it on my kindle.
Tesseract was amazingly powerful and accurate, but it seemed to struggle if the page was warped or tilted even a little. I had to preprocess the images heavily to try to dewarp the natural spine curvature, and even then it could only get about 99% accuracy (which sounds like a lot, but consider a book where every 100th letter was wrong - I basically flagged the errors on my kindle as I went along and manually corrected them later).
I guess the point of this comment is that, in my experience, Tesseract.js is probably going to need an accompanying PageDewarp.js for it to be of use scanning books. Not everyone has access to a right angle scanner or can slice the spine and get perfectly straight high-res scans.
That's very interesting given that Tesseract uses Leptonica. I'm not sure if they use it for dewarping but all of my little projects with Leptonica really worked well. Dewarping, binarizing, extracting individual elements etc.
Maybe I wasn't using tesseract to its fullest potential, but I had a really hard time getting it to do accurate OCR on warped paged - straight pages worked perfect.
Also, the last time I checked tesseract liked stuff to be 200-300 dpi. (You don’t have to scan it at that resolution, but helps if you scale it up.)
Years ago, I dug up a couple of papers on the spine thing, but never got around to implementing it. I think you can estimate the curvature based on the shadow and dewarp.
I was just scanning some recipes to save typing, so it wasn’t really worth the efffort.
I ended up using this guy's tool[1] for dewarping, which worked pretty well. The tool was more a prototype than anything, but it was enough to finish my project.
Tesseract was amazingly powerful and accurate, but it seemed to struggle if the page was warped or tilted even a little. I had to preprocess the images heavily to try to dewarp the natural spine curvature, and even then it could only get about 99% accuracy (which sounds like a lot, but consider a book where every 100th letter was wrong - I basically flagged the errors on my kindle as I went along and manually corrected them later).
I guess the point of this comment is that, in my experience, Tesseract.js is probably going to need an accompanying PageDewarp.js for it to be of use scanning books. Not everyone has access to a right angle scanner or can slice the spine and get perfectly straight high-res scans.