In case it's not clear, Tesseract is developed by Google since 2006, having been started at HP in 1985 and open-sourced by HP in 2005. [1]
As far as I know, it powers all OCR at Google (e.g. in Keep, Docs, etc.).
This (Tesseract.js) is a WASM port of the project by a separate group of people.
I investigated using this port a couple years ago, but as you can see from the demo, it's fairly slow to initialize and run, so I never found a practical use for OCR client-side rather than server-side, but I still think it's tremendously cool.
In case anyone's interested (shameless plug), because I do a lot of academic research that involves tons of copying from webpages, PDF's and screenshots and pasting into notes documents, I created a tool at https://pastemagic.com that helps selectively remove rich text formatting, remove line breaks and does OCR on screenshots and camera photos. Setting up Tesseract on my server and creating a simple HTTP endpoint for it took less than an hour, and for free I had OCR as powerful as Google's. Pretty cool I thought.
Disclaimer: I was the original author of Tesseract.JS— though all the hard work nowadays is done by Jerome Wu. If you're interested in supporting the project, consider backing the OpenCollective (https://opencollective.com/tesseractjs)
By all of that, what I mean to say is that I've learned a decent amount of fun OCR trivia over the past few years.
Firstly, the engine that powers Google Cloud Vision is almost certainly an entirely independent code base from Tesseract built on neural networks. In fact, the most recent major version of Tesseract (version 4.0) was a sort of rewrite of the core of Tesseract to use bidirectional LSTMs to seem a bit more like the modern OCR pipeline that systems like GCV use.
The original Tesseract algorithm dates back a previous AI spring— in the 80s when neural networks were cool (before they were uncool, and then subsequently cool again). The core of the original algorithm involved fitting polygons to character shapes in order generate features which could be matched by a kind of rudimentary neural network.
One of the primary authors of Tesseract is Ray Smith (at Google)— who gave a presentation at some point a few years ago about the history of OCR— though I can't quite find a link to it at the moment.
OCR actually predates electronic computers. In 1929, someone had invented a machine that would take a piece of paper and shine a bright light on a single letter, and pass the letter through a carousel of letter masks, so that it could hit a (effectively single pixel) photo-sensor. When the carousel and the letter mask were in alignment with the printed letter, then the drop in brightness registered that a particular letter was seen!
OCR was used by the US Postal system for sorting mail as early as 1965, but it wasn't until 1976 that any system could reasonably support more than a certain number of hard-coded fonts (fun fact this was invented by Ray Kurzweil, the Singularity is Near guy).
There are open source versions of everything done within a GCP API call, but it requires multiple machines and lots of data to build an NLP model to be as fast and accurate as GCP, and cloud computing is relatively new compared to OCR.
There are? Can you give a list of pointers or what to look for?
I was looking for an OCR that can do license plates while the car is moving, for a hobby project. The image quality is less than perfect, the lighting is never very good, and as the camera is mounted on my side window, all plates have a perspective transformation applied (e.g., topline and baseline are essentially never parallel)
Tesseract fails miserably. Trying to help it, I have not found a good open source project that would consistently equalize color pictures to black-and-white - sometimes there's shadow on the plates that foils all simple attempts.
And yet, GCV needs no parameters, and seem to do this perfectly on images I've tried.
So, assuming I'm willing to put in the time - how do I build my own GCV -- even if it's just for the hobby use case of reading license plate (and the next stage: reading house numbers - which GCV does reasonably well, although it is a much much harder problem)
Training the model would be computationally intensive, but deploying that to use Tensorflow.js and predicting a single datapoint in the browser shouldn't be as much, right?
There are ML models that are so computationally intensive that they can't reasonably run on the edge. AI accelerator chips obviously help move the line, but AI accelerators benefit the cloud, too. Furthermore, Models can be tens to hundreds of megabytes in size. Okay for the cloud, not okay for wasm running in the browser.
> As far as I know, it powers all OCR at Google (e.g. in Keep, Docs, etc.).
Tesseract is acceptable only if the text is neatly laid out, in more-or-less straight parallel lines or at the very least consistent orientation that's close enough to being straight horizontal lines.
Google Cloud Vision, however, can read any orientation, any font, through perspective distortion and does not need the different text blobs in the image to be consistent in any way. Superior in every way to plain Tesseract (and if it is Tesseract after preprocessing, the magic is in that preprocessing more than it is in Tesseract)
I would actually be very surprised to hear GCV uses Tesseract; and if they don't, why would they use something inferior for other products?
Oh very interesting. I'd verified the output was identical a couple of years ago, and that Keep and Docs in production were using the 4.0 beta release at the time. But if Cloud OCR is better, makes sense they would have switched since then.
Tesseract 4.0 has a brand-new neural engine that totally supersedes the earlier engine, however -- I wonder if there's any relation between that and Cloud OCR?
It's probably a detection neural net (such as Faster R-CNN) for putting bounding boxes around words, which is complicated by the fact that it can predict polygons in any orientation, followed by a LSTM-CRF layer for text transcription. It's a good generalist OCR but often has sub-par results for specific types of input. It tens to often miss single letters surrounded by whitespace.
I've been working on an editor on top of ProseMirror to support saving web content in the form of rich text and predefined schemas. Given that you have academic research in this area of web OCR, what's the current literature or tools on saving web content using both html and visual cues from the rendered html? For example, both <figure><img><figcaption> and <div><img><p> visually look like captioned images, but are represented differently in html. Is there a way to parse that into a simple [figure, [img], [figcaption]] schema?
Oh sure, on my webserver I just wrote the POSTed image to a temp file, called the command-line utility itself from within my code, and captured the stdout to return back. The command-line utility initializes very quickly so the performance was fine.
The only semi-tricky bits were parsing stderr if anything went wrong (distinguish warnings from actual errors), and the fact that Tesseract doesn't respect the JPEG orientation bit (big problem with iPhone camera images), so checking that and manually rotating the JPEG first if necessary (gibberish otherwise).
pastemagic is actually a really cool way to read Wikipedia articles. The link density of the usual Wikipedia article is very high and I find it very distracting when some significant portion of the text is in a different color. Combined with the nice typesetting in the output, it makes for a very pleasant reading experience.
The tesseract-cli (and so I'm sure the library also) will give you HOCR output, which is an HTML format that gives you the text, with bounding boxes around paragraphs and individual characters.
It is opensource and runs on Java.You can also extract the areas of interest in the pdf and run it via cmdline[1].You can get more details if required on my blog[2]
I think the Project Naptha extension by the folks that wrote this library will do that, no?
https://projectnaptha.com/
Not sure if it only reads at those coordinates vs. OCRing the whole thing (for example if you were legally prohibited from OCRing content outside a certain coordinate space), but it is selectable.
You possibly have one installed. Mine comes with my desktop (Xfce), and gives me a GUI and a CLI to take screenshots of the full desktop, any window, or a particular area defined by crosshairs.
There's a very popular and minimalist CLI called scrot that I think would be ideal... well scratch that, I made a search and our question has already been asked and answered:
If I remember correctly, I did it with the ImageMagick "import" command. I found I had to add a wide white border, as Tesseract got confused near the edges of the image (this was over 10 years ago though).
I prefer something I can install locally (doesn't need to be open source). I'm trying to extract text from a PDF at a certain position, the PDF is indeed text not an image so OCR isn't strictly needed.
The goal is to draw a box using GUI, then use those coordinates to extract text from several homogeneous pages.
I also have a different goal of trying to interpret structure of a PDF that has visual structure (headers, sections and subsections all numbered). But that seems to lend itself to some sort of text parsing.
I also have a different goal of trying to interpret structure of a PDF that has visual structure (headers, sections and subsections all numbered). But that seems to lend itself to some sort of text parsing.
> Tesseract is developed by Google since 2006, having been started at HP in 1985 and open-sourced by HP in 2005.
Fun fact: it actually started as a US national defence initiative in the last big AI hype bubble in the 1980s.
While it isn't in the wiki, probably for intellectual property reasons, I'm almost positive Cuneiform originated as the Soviet version of the same thing (I have evidence in notebooks somewhere): OCR was something needed for "AI" of the 1980s in the Soviet system as well.
Had it been further developed by Microsoft ... or Apple, in contrast to Tesseract, it would have continued to be the official OCR of an opposing world-historical system.
As far as I know, it powers all OCR at Google (e.g. in Keep, Docs, etc.).
This (Tesseract.js) is a WASM port of the project by a separate group of people.
I investigated using this port a couple years ago, but as you can see from the demo, it's fairly slow to initialize and run, so I never found a practical use for OCR client-side rather than server-side, but I still think it's tremendously cool.
In case anyone's interested (shameless plug), because I do a lot of academic research that involves tons of copying from webpages, PDF's and screenshots and pasting into notes documents, I created a tool at https://pastemagic.com that helps selectively remove rich text formatting, remove line breaks and does OCR on screenshots and camera photos. Setting up Tesseract on my server and creating a simple HTTP endpoint for it took less than an hour, and for free I had OCR as powerful as Google's. Pretty cool I thought.
[1] https://github.com/tesseract-ocr/tesseract