In case it's not clear, Tesseract is developed by Google since 2006, having been...

antimatter15 · on Dec 20, 2019

Disclaimer: I was the original author of Tesseract.JS— though all the hard work nowadays is done by Jerome Wu. If you're interested in supporting the project, consider backing the OpenCollective (https://opencollective.com/tesseractjs)

By all of that, what I mean to say is that I've learned a decent amount of fun OCR trivia over the past few years.

Firstly, the engine that powers Google Cloud Vision is almost certainly an entirely independent code base from Tesseract built on neural networks. In fact, the most recent major version of Tesseract (version 4.0) was a sort of rewrite of the core of Tesseract to use bidirectional LSTMs to seem a bit more like the modern OCR pipeline that systems like GCV use.

The original Tesseract algorithm dates back a previous AI spring— in the 80s when neural networks were cool (before they were uncool, and then subsequently cool again). The core of the original algorithm involved fitting polygons to character shapes in order generate features which could be matched by a kind of rudimentary neural network.

One of the primary authors of Tesseract is Ray Smith (at Google)— who gave a presentation at some point a few years ago about the history of OCR— though I can't quite find a link to it at the moment.

OCR actually predates electronic computers. In 1929, someone had invented a machine that would take a piece of paper and shine a bright light on a single letter, and pass the letter through a carousel of letter masks, so that it could hit a (effectively single pixel) photo-sensor. When the carousel and the letter mask were in alignment with the printed letter, then the drop in brightness registered that a particular letter was seen!

OCR was used by the US Postal system for sorting mail as early as 1965, but it wasn't until 1976 that any system could reasonably support more than a certain number of hard-coded fonts (fun fact this was invented by Ray Kurzweil, the Singularity is Near guy).

alexcnwy · on Dec 20, 2019

First of all thank you for all your hard work.

Major question:

Why isn’t Tesseract using neural networks. I know it just introduced LSTM based models but they suck.

Why is GCP vision text recognition API so much better than open source alternatives?!

xanderjanz · on Dec 20, 2019

There are open source versions of everything done within a GCP API call, but it requires multiple machines and lots of data to build an NLP model to be as fast and accurate as GCP, and cloud computing is relatively new compared to OCR.

beagle3 · on Dec 21, 2019

There are? Can you give a list of pointers or what to look for?

I was looking for an OCR that can do license plates while the car is moving, for a hobby project. The image quality is less than perfect, the lighting is never very good, and as the camera is mounted on my side window, all plates have a perspective transformation applied (e.g., topline and baseline are essentially never parallel)

Tesseract fails miserably. Trying to help it, I have not found a good open source project that would consistently equalize color pictures to black-and-white - sometimes there's shadow on the plates that foils all simple attempts.

And yet, GCV needs no parameters, and seem to do this perfectly on images I've tried.

So, assuming I'm willing to put in the time - how do I build my own GCV -- even if it's just for the hobby use case of reading license plate (and the next stage: reading house numbers - which GCV does reasonably well, although it is a much much harder problem)

mentat · on Dec 21, 2019

I had some good luck with https://github.com/sergiomsilva/alpr-unconstrained/blob/mast... as long as the images were high enough resolution. You might want to check it out, comes with trained models.

beagle3 · on Dec 23, 2019

Thanks!

bhl · on Dec 20, 2019

Training the model would be computationally intensive, but deploying that to use Tensorflow.js and predicting a single datapoint in the browser shouldn't be as much, right?

est31 · on Dec 21, 2019

There are ML models that are so computationally intensive that they can't reasonably run on the edge. AI accelerator chips obviously help move the line, but AI accelerators benefit the cloud, too. Furthermore, Models can be tens to hundreds of megabytes in size. Okay for the cloud, not okay for wasm running in the browser.

xanderjanz · on Dec 20, 2019

Also AFAIK GCV uses techniques beyond better OCR that greatly help accuracy. It does image fixing, boundary detection, NLP, spell check, etc.

beagle3 · on Dec 20, 2019

> As far as I know, it powers all OCR at Google (e.g. in Keep, Docs, etc.).

Tesseract is acceptable only if the text is neatly laid out, in more-or-less straight parallel lines or at the very least consistent orientation that's close enough to being straight horizontal lines.

Google Cloud Vision, however, can read any orientation, any font, through perspective distortion and does not need the different text blobs in the image to be consistent in any way. Superior in every way to plain Tesseract (and if it is Tesseract after preprocessing, the magic is in that preprocessing more than it is in Tesseract)

I would actually be very surprised to hear GCV uses Tesseract; and if they don't, why would they use something inferior for other products?

bhanhfo · on Dec 20, 2019

> As far as I know, it powers all OCR at Google (e.g. in Keep, Docs, etc.).

Afaik Google no longer uses Tesseract for any of its products. Googles Clould OCR is much better than Tesseract.

I think Google devs still work on Tesseract, but only as their side project (not sure about this, obviously)

raisedbyninjas · on Dec 20, 2019

IME Google's OCR is much more accurate than Tesseract. I doubt they still use it.

crazygringo · on Dec 20, 2019

Oh very interesting. I'd verified the output was identical a couple of years ago, and that Keep and Docs in production were using the 4.0 beta release at the time. But if Cloud OCR is better, makes sense they would have switched since then.

Tesseract 4.0 has a brand-new neural engine that totally supersedes the earlier engine, however -- I wonder if there's any relation between that and Cloud OCR?

ComputerGuru · on Dec 20, 2019

“Cloud OCR” is an interface. Something is still doing the OCR behind the scenes (and that may indeed not be Tesseract).

visarga · on Dec 20, 2019

It's probably a detection neural net (such as Faster R-CNN) for putting bounding boxes around words, which is complicated by the fact that it can predict polygons in any orientation, followed by a LSTM-CRF layer for text transcription. It's a good generalist OCR but often has sub-par results for specific types of input. It tens to often miss single letters surrounded by whitespace.

visarga · on Dec 20, 2019

> As far as I know, it powers all OCR at Google (e.g. in Keep, Docs, etc.).

Tesseract is the shittiest OCR and Google doesn't use it internally. Their cloud OCR offering is much more performant.

bhl · on Dec 20, 2019

I've been working on an editor on top of ProseMirror to support saving web content in the form of rich text and predefined schemas. Given that you have academic research in this area of web OCR, what's the current literature or tools on saving web content using both html and visual cues from the rendered html? For example, both <figure><img><figcaption> and <div><img><p> visually look like captioned images, but are represented differently in html. Is there a way to parse that into a simple [figure, [img], [figcaption]] schema?

dstroot · on Dec 20, 2019

Can you share more information about how you created an HTTP endpoint? Or code? Glancing at the docs I only saw a command line or C bindings.

crazygringo · on Dec 20, 2019

Oh sure, on my webserver I just wrote the POSTed image to a temp file, called the command-line utility itself from within my code, and captured the stdout to return back. The command-line utility initializes very quickly so the performance was fine.

The only semi-tricky bits were parsing stderr if anything went wrong (distinguish warnings from actual errors), and the fact that Tesseract doesn't respect the JPEG orientation bit (big problem with iPhone camera images), so checking that and manually rotating the JPEG first if necessary (gibberish otherwise).

davvolun · on Dec 23, 2019

Do you have it somewhere on github? I could see hosting this on my own server for personal use, maybe trying to contribute back as well.

nkrisc · on Dec 20, 2019

pastemagic is actually a really cool way to read Wikipedia articles. The link density of the usual Wikipedia article is very high and I find it very distracting when some significant portion of the text is in a different color. Combined with the nice typesetting in the output, it makes for a very pleasant reading experience.

nick_name · on Dec 20, 2019

You might also like Wikiwand - https://www.wikiwand.com

alexcnwy · on Dec 20, 2019

Tesseract simply cannot power the GCP OCR API because Tesseract sucks super bad and the GCP API is mint.

lucasverra · on Dec 20, 2019

added your tool to my firefox new tab. If other say that Google OCR is "way better" with don't you implement that behind your endpoint ?

crazygringo · on Dec 20, 2019

I'll have to look into it! But right now it's a free tool so I can't afford to set up paid OCR :)

NewDimension · on Dec 20, 2019

Somewhat offtopic, do you know of a library that would allow me to select an area of a PDF through a GUI and only read the text in those coordinates?

ncallaway · on Dec 20, 2019

The tesseract-cli (and so I'm sure the library also) will give you HOCR output, which is an HTML format that gives you the text, with bounding boxes around paragraphs and individual characters.

https://github.com/tesseract-ocr/tesseract/wiki/Command-Line...

It's not quite what you want, but I think you could probably filter the output based on the selected region and pretty quickly get what you want.

narayanans · on Dec 21, 2019

Try tabula[0]

It is opensource and runs on Java.You can also extract the areas of interest in the pdf and run it via cmdline[1].You can get more details if required on my blog[2]

[0]https://tabula.technology/

[1]https://github.com/tabulapdf/tabula-java/wiki/Using-the-comm...

[2]https://narayanansiyer.com/Tabula/tabula/

sailfast · on Dec 20, 2019

I think the Project Naptha extension by the folks that wrote this library will do that, no? https://projectnaptha.com/

Not sure if it only reads at those coordinates vs. OCRing the whole thing (for example if you were legally prohibited from OCRing content outside a certain coordinate space), but it is selectable.

severine · on Dec 20, 2019

You could simply pipe an area screenshot to tesseract, discard the input image and get the tesseract output, am I wrong?

NewDimension · on Dec 20, 2019

That sounds like a valid approach, any idea what tools I could use to get the define the area and get the screenshot?

severine · on Dec 20, 2019

You possibly have one installed. Mine comes with my desktop (Xfce), and gives me a GUI and a CLI to take screenshots of the full desktop, any window, or a particular area defined by crosshairs.

There's a very popular and minimalist CLI called scrot that I think would be ideal... well scratch that, I made a search and our question has already been asked and answered:

https://askubuntu.com/questions/280475/how-can-instantaneous...

https://stackoverflow.com/questions/21497447/ocr-on-a-screen...

mkl · on Dec 20, 2019

If I remember correctly, I did it with the ImageMagick "import" command. I found I had to add a wide white border, as Tesseract got confused near the edges of the image (this was over 10 years ago though).

mdtusz · on Dec 20, 2019

I'm not sure if there's a non GUI interface for it, but zathura does this for pdfs.

jjohansson · on Dec 20, 2019

Commercial or open source? PDFTron can do it, but they’re not an open source project.

NewDimension · on Dec 20, 2019

I prefer something I can install locally (doesn't need to be open source). I'm trying to extract text from a PDF at a certain position, the PDF is indeed text not an image so OCR isn't strictly needed.

The goal is to draw a box using GUI, then use those coordinates to extract text from several homogeneous pages.

I also have a different goal of trying to interpret structure of a PDF that has visual structure (headers, sections and subsections all numbered). But that seems to lend itself to some sort of text parsing.

severine · on Dec 20, 2019

I also have a different goal of trying to interpret structure of a PDF that has visual structure (headers, sections and subsections all numbered). But that seems to lend itself to some sort of text parsing.

Some reading here: https://stackoverflow.com/questions/53219016/detecting-secti...

jjohansson · on Dec 20, 2019

PDFTron provides an SDK and isn't really meant as a plug-and-play end-user application. But it can accomplish what you're looking for.

Here's how to extract text from a PDF based on coordinates (this explains how to do it on web, but it's also possible using other platforms):

https://groups.google.com/d/msg/pdfnet-webviewer/h2W3VksbQUI...

Here's how to extract a PDF's logical structure:

https://www.pdftron.com/documentation/samples/#logicalstruct...

pierre · on Dec 20, 2019

Pdf.js and filtering the output. Par.sr with the good input module configuration

amelius · on Dec 20, 2019

Curious, does it use Deep Learning techniques, and for what tasks?

scottlocklin · on Dec 20, 2019

> Tesseract is developed by Google since 2006, having been started at HP in 1985 and open-sourced by HP in 2005.

Fun fact: it actually started as a US national defence initiative in the last big AI hype bubble in the 1980s.

While it isn't in the wiki, probably for intellectual property reasons, I'm almost positive Cuneiform originated as the Soviet version of the same thing (I have evidence in notebooks somewhere): OCR was something needed for "AI" of the 1980s in the Soviet system as well.

Had it been further developed by Microsoft ... or Apple, in contrast to Tesseract, it would have continued to be the official OCR of an opposing world-historical system.