10 years ago I used Abbyy Finereader to add an OCR text layer to about 500,000 p...

pault · on Nov 9, 2017

I had to research the commercial OCR market recently for a client project. It's abysmal. Most of the solutions are horrendously overpriced, windows only enterprise packages, and the ones that I was able to try out bordered on unusable (bad results and terrible UI). Since OCR is basically "hello world" for tensorflow, I don't understand why these incumbents haven't been wiped off the map.

staticautomatic · on Nov 9, 2017

Lots of reasons they haven't been wiped off the map.

1. Full page OCR is trivial these days. Anyone who's just doing full-page OCR has no business using one of these commercial offerings. The commercial stuff is really for extraction of structured info from unstructured and semi-structured documents. ABBYY and Nuance are the only ones with products that can handle it. There are alternatives for simple capture tasks (like DocParser), but not for complex ones.

2. ABBYY and Nuance have a lot of IP in the space on lock.

3. The market for complex data extraction (at least the kind of stuff I do) may not actually be big enough for smaller players to bother pursuing (most new players are going after tasks of intermediate difficulty, like invoice and receipt capture).

4. There's still a need for on-prem style solutions for people doing huge volumes (hundreds of thousands of pages a year), whereas most of the new market entrants are cloud only.

5. I haven't used Nuance's stuff much but ABBYY's products are actually incredibly robust under the hood-- stuff you wouldn't want to build yourself unless you absolutely had to. FlexiCapture will do 95%+ accuracy out of the box (on a character by character basis) with no human verification.

derefr · on Nov 9, 2017

> Anyone who's just doing full-page OCR has no business using one of these commercial offerings.

So what are you supposed to use for full-page OCR?

rb2018 · on Nov 10, 2017

As far as hosted solutions go, the best are Google Cloud Vision, Azure OCR and OCr.space

You can compare all three here: https://ocr.space/compare-ocr-software

ocrcustomserver · on Nov 10, 2017

For an opensource solution that uses Tesseract, check out ocrmypdf: https://github.com/jbarlow83/OCRmyPDF

ocrcustomserver · on Nov 10, 2017

I wouldn't say that full page OCR is trivial. Using an opensource solution (99% based on Tesseract) is going to get you ok-ish results if your input is relatively clean (no complex layout, scanned documents from a flatbed scanner, standard fonts) and you don't care about speed. If you care about recognition accuracy then Tesseract isn't going to cut it (at least not without some serious effort).

Replying to points 1 and 3: For smaller players and/or complex tasks you can always implement your own custom parser.

I'm doing work as a contractor in this space.

staticautomatic · on Nov 10, 2017

I agree with you that Tesseract isn't great out of the box, but if you aren't doing huge volumes, there are plenty of cloud options available.

Respectfully, I disagree about this being a parsing issue. The whole reason so-called "zonal ocr" exists is because of the challenges of reliably inferring the structure of a document at parsing time. Yes, there are some kinds of documents where parsing logic alone will suffice, but for more complex tasks you need what ABBYY and Nuance are selling.

ocrcustomserver · on Nov 12, 2017

Just to be sure that we're talking about the same thing, by "custom parser" I meant implementing your own barebones "zonal OCR" functionality with just the features that are needed for the specific problem.

I think it boils down to the needs of each individual application.

Some cases have a lot of templates and need the "automatic fuzzy matching" functionality and the extra bells and whistles.

But smaller players often deal with just a handful of relatively simple templates where FlexiCapture would just be overkill (not to mention a couple of other problems that I'm covering at the end of the post). This is of course not an easy task because you need someone who can design and implement an end to end system that possibly involves image processing, "zonal OCR", include an OCR engine and also perform reliable text extraction from images/PDFs (extracting text from PDFs is tricky). It's way easier for a non developer to think about what rulesets/logic to apply and not having to think about the image processing/OCR bits. I think that is one of the main selling points of FlexiCapture. It abstracts the OCR bits so that the system designer can think about the problem itself, design a spec and think about the logic the logic. Do you need deskewing of documents? Click a button and you get deskewing.

Which brings me to the second point. The products sold by ABBYY/Nuance are meant to be used by integrators (no programming needed other than the occasional VB.net script), not image processing specialists/developers. In my (biased) opinion, it makes more sense for some businesses to go the custom route instead of investing in FlexiCapture.

There is also FlexiCapture Engine that is meant for developers. This has the same problems as the other offerings by ABBYY (I don't know about Nuance but I suspect it's the same):

  - expensive
  - vendor lock-in
  - ridiculous extra costs for things like "cloud/VM license", exporting to PDF, etc.
  - limits on how many pages you can process per year or in total (complex licensing schemes)
  - ABBYY really wants to sell you their own cluster/cloud management services which is all proprietary
  - limited flexibility in implementing distributed services, costs that add up fast, you have to be trained in their own stack

Can you provide an example where you think that a custom solution would not work? I'm curious.

staticautomatic · on Nov 12, 2017

First of all, why don't you shoot me an email at info@jurymatic.com and we can talk further. In a nutshell, it would have been way more expensive and difficult for us to roll our own than even the high cost of a FlexiCapture license. But here's a reasonably complete explanation of the build vs buy analysis we did.

1. FlexiCapture makes pre-processing incredibly painless and training-free. Beyond the usual binarization stuff, we extensively use the built in auto-rotation and cleanup (skew, noise, speckle, etc.).

2. Templating is really the big win for FlexiCapture. I have not seen anything else with a template GUI that comes close to being as usable, robust, or simple. That's really important to us because we build a LOT of templates. I have a really hard time imagining having to code them.

3. FlexiCapture's template engine is super strong for the kinds of documents we work with, which is mainly complex repeating groups with nested structures. It's also really good at handling both photos of documents (e.g. mobile) and scans. One thing it also offers that I haven't seen elsewhere in a turnkey product or existing platform is the ability to define zones in purely relative terms without absolute positions. I don't know about Nuance but I've not seen any other template GUI that will allow you to spec something like "look for either this word or a two-line string containing these words in the upper left quadrant of the document."

4. There's a dearth of zonal ocr frameworks. Outside of ABBY and Nuance's SDK's, the only one I'm even aware of is OpenKM, and I don't write Java. The FlexiCapture Engine SDK is a terrible beast. The documentation is horrible, it's Windows only, and it's all COM objects.

computerex · on Nov 9, 2017

OCR is not "hello world" for tensorflow. The entire article is devoted to actually developing the OCR pipeline. They ported the model over to tensorflow as a secondary thing.

wwarner · on Nov 9, 2017

Character recognition might be "hello world" but full page OCR definitely is not. When I read the article and think about all the obstacles dropbox overcame, it looks like a pretty huge achievement. I can tell you, it would take me a lot longer than 8 months do pull it off.

Drdrdrq · on Nov 9, 2017

I am certain they had a team on this, I just wonder what its size was. Quite an achievement!

sidlls · on Nov 9, 2017

OCR for really basic documents might be "hello world", but as usual real world practical applications have all sorts of interesting things that make this a harder problem than it seems.

amelius · on Nov 9, 2017

> Since OCR is basically "hello world" for tensorflow [...]

In that case ... it would be nice if Tensorflow had OCR as one of its examples/tutorials.

acdha · on Nov 9, 2017

> Since OCR is basically "hello world" for tensorflow, I don't understand why these incumbents haven't been wiped off the map.

Do you have an example of anyone working on that? I think those packages have a lot of inertia because you can buy a product with an API, documentation, etc. Something which says “Learn TensorFlow & enough AI to be dangerous” is going to be hard to get in the door at many large organizations so this might be an area where a solid open-source project could have a significant impact.

slap_shot · on Nov 9, 2017

The parent post is not implying that companies adopt Tensorflow themselves and solve this problem. The post is implying that a competitor come in to the market using the newer, better technology, and replace them with a better product (with an API, documentation, etc).

This issue here is that industries become stagnant when it appears difficult/costly to introduce a new product in that market. You usually need significant a technical and/or social change to occur before incumbents can be displaced. In this case, it was a technical change.

The other part of the issue is that investors want "0 to 1" ideas to invest in, and founders often chase those ideas. In reality, we need more "1.0 to 1.5" products, especially in crowded markets.

acdha · on Nov 9, 2017

I think your last point is correct: there's a huge gap between the kind of “hello world” demo the original poster was talking about and a product-level tool which has reasonable training data and performance across a non-trivial range of documents.

I think that getting there is a lot more work than OP anticipated.

pault · on Nov 10, 2017

OP here; yes, you've summed it up perfectly.

phren0logy · on Nov 9, 2017

I recently looked at a number of packages, and the best option in my opinion is ABBYY Finereader Pro for Mac. The mac version can easily be automated with Hazel or Automator, which adds features that cost +$400 more for the PC. The results are worlds better than Acrobat.

philipkglass · on Nov 9, 2017

When I did my big project I was using the cheap standard edition of Finereader on (virtualized) Windows, and I built an automated pipeline for it with a combination of AutoHotkey, pdftk, Multivalent pdf tools, and Python.

I too tried the Acrobat OCR first and found it to be vastly inferior. One surprise was that (at the time at least) the OCR-by-Acrobat files were not only much less accurate in the text layer, but the file size also bloated up surprisingly.

ocrcustomserver · on Nov 10, 2017

Some thoughts (adding to staticautomatic's post):

1. There is no dataset/competition like ImageNet for OCR.

2. Most people/conferences/universities are going after natural images and "computer vision" problems. OCR is its own animal and while it shares some concepts with computer vision it's not the same thing.

3. A lot of IP, knowledge and talent locked up in a handful of very old companies doing this for a long time. ABBYY is for OCR what Google + Facebook are for deep learning (maybe more).

4. OCR is kind of a niche, a lot of knowledge is not available to many people outside of a few insiders (ABBYY/Nuance, universities, research labs, OCR conferences). I'm sure Google uses it a lot internally (e.g. Google Street View numbers etc.).

5. The incumbents don't just do OCR. They do preprocessing (computer vision/image processing) + OCR + NLP.

6. Hard to find data. ABBYY Finereader supports 190 languages. Collecting this data is no easy task.

I'm probably missing other reasons as well, but this is just off the top of my head.

That being said, I'm sure that there's going to be a lot of progress in the OCR + deep learning space soon.

rb2018 · on Nov 10, 2017

>Since OCR is basically "hello world" for tensorflow,

This is a strong statement. Do you have any links to code or documentation that implements high-quality OCR with tensorflow?

pault · on Nov 10, 2017

Not high quality. But since MNIST is typically used as a baseline reference, I was surprised that more advanced character recognition wasn't a staple in the field. From the tensorflow introduction[0]: "When one learns how to program, there's a tradition that the first thing you do is print 'Hello World.' Just like programming has Hello World, machine learning has MNIST."

I realize that this is a result of my lack of knowledge about the field, but the ML hype train makes it easy to overestimate the capabilities of deep learning as a layperson.

[0] https://www.tensorflow.org/get_started/mnist/beginners

bflesch · on Nov 9, 2017

Because for example Abby replaces the image with an PDF with invisible text overlays. I think it is a bit more complicated once you account for the formatting of documents.

milesokeefe · on Nov 9, 2017

Abbyy has had that ability for a longer period of time so it may be more accurate, but this is something tesseract supports[1]. In fact I'd say most OCR systems support it with varying degrees of accuracy.

[1]:https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-do-i...

RandomBookmarks · on Nov 10, 2017

The ability to create searchable PDFs is very useful and convenient. But creating searchable PDFs does not require a deep understanding of the document format (like column detection etc). You just place the words at the right coordinates of their bounding boxes. You can test this for example here: https://ocr.space - select the option to create a searchable PDF. It works even for the most complex documents.

Now, creating a Word document from a scan is a different beast because it requires layout analysis. This is where Abbyy with its long experience still has a good lead.

pault · on Nov 9, 2017

My issue was that I needed one that exposes an API for doing OCR on pictures taken from mobile devices. It was really difficult to find non-desktop packages.

staticautomatic · on Nov 9, 2017

Out of curiosity, what did you need beyond something like OpenCV/Metal + Tesseract?

pault · on Nov 9, 2017

The client needed an OCR solution for supplier invoices with a variety of layouts and a combination of printed and hand-written characters, and didn't have the budget for a bespoke solution. To be fair, it's a very hard problem, I was just surprised that given all the much hyped recent advancements in deep learning for computer vision, most of the solutions in the market seem to be running on decades old technology.

ocrcustomserver · on Nov 10, 2017

Well it is more complex than it appears.

Extracting data from documents requires a solution which uses OCR but is a different product (e.g. ABBYY FlexiCapture).

This is most commonly referred to as zonal OCR and comes with the added functionality of handling multiple templates, defining zones/fields, specifying special rules for fields, verification process for manual inspection (e.g. triggered when the image receives a low confidence recognition score) etc. This is different and more complex than a product that does full page OCR (e.g. ABBYY Finereader).

Handwritten OCR is a whole different story. The products that do zonal OCR will fail to recognize handwritten text, unless it's in boxes (PDF forms). I'm working on a prototype that can handle handwritten text outside of boxes too.

criddell · on Nov 9, 2017

Do you know what Evernote uses? Although I don't really use it anymore, when I did, I was astonished at how will it found and indexed text.

ocrcustomserver · on Nov 10, 2017

They use multiple OCR engines. Some developed in-house and some proprietary ones. A blog post mentions I.R.I.S. as one of the proprietary ones.

They don't offer OCR publicly, instead, they generate a list of possible candidates to match against search queries (fuzzy matching).

From one of the blog posts: "Employing multiple reco engines is important not only because they specialize on different types of text and languages, but also as this allows to implement ‘voting’ mechanism — analyzing word alternatives created by diverse engines for the same word allows for better suppression of false recognitions and giving more confidence to consensus variants."

More: http://blog.evernote.com/tech/2013/07/18/how-evernotes-image...

http://blog.evernote.com/tech/2011/11/01/even-grittier-detai...

http://blog.evernote.com/tech/2011/09/30/evernote-indexing-s...

jhayward · on Nov 9, 2017

If I recall, they don't do traditional OCR, rather they build a term probability database that lets them do matching on search terms. Both Evernote and ABBYY seem to have ties to a specific Russian research group.

romuloab42 · on Nov 9, 2017

I know you are talking about Desktop OCR, but I had a very good experience with Google Vision API. On my previous gig we were trying to automate receipt scanning, and it gave very good results with no previous image manipulation whatsoever (meaning, no special camera alignment, lighting condition, rotation, skewing, etc).

It was not perfect, but I was very impressed with the quality, speed, and price.

ocrcustomserver · on Nov 10, 2017

ABBYY has dominated the field for many years (decades really) and still outperforms every solution out there. OmniPage by Nuance is probably the second best.

Preprocessing the images (OCR pipeline) is very important for OCR. For generic scanned PDF documents Finereader does a pretty good job.

There is a lot of stuff going on in a OCR engine. Layout analysis, dewarping, binarization, deskewing, despeckling (and others) and then there's the OCR itself. With Tesseract you have to do a lot of things yourself, you have to provide it with a clean image. The commercial packages do that for you automatically. ABBYY and other solutions also use NLP to augment/check the OCR results from a semantic analysis perspective.

Also, there is no "one size fits all" OCR. It is highly specific to the nature of the application. Consider the following use cases:

  - scanned PDF document
  - scanned document with a non-standard font (e.g. Fraktur script in a historic book)
  - photo of scanned document acquired with a mobile phone's camera
  - passport OCR (MRZ)
  - credit card OCR
  - text appearing in natural image (e.g. store sign)

These are all "OCR projects" but they require very different approaches. You cannot just throw any input image at an OCR engine and expect it to work. It often requires a mix of computer vision/image processing, machine learning and OCR engine.

There is a growing number of papers using deep learning that get submitted to ICDAR (the premier OCR conference) and the other OCR conferences. One of the problems is the lack of a universal dataset/competition like ImageNet. The SmartDoc competition (documents captured from smartphones) was cancelled this year due to an insufficient number of participants.

If anyone is doing work with OCR + deep learning, I'd love to discuss!

tensor · on Nov 9, 2017

The unreleased v4 version of tesseract has a new engine based on state of the art deep learning techniques. In some initial poking at it, it looks like it might blow the existing commercial solutions away in terms of quality.

tobltobs · on Nov 9, 2017

In mumble experience it is much better then v3 but still not as good as Abby.

tensor · on Nov 9, 2017

Can you elaborate on this a bit? It still has some edge case quality issues, but I haven't seen anything where Abby does better so far. That said, the default app is missing a bunch of preprocessing that you have to add (e.g. page deskew, flipping, etc).

ktta · on Nov 10, 2017

Do you have an idea of when it is going to be released? Or this is only going to be a Google internal thing?

wwarner · on Nov 10, 2017

it's in the master branch, it just hasn't been tagged for release yet.

ktta · on Nov 10, 2017

Oh, thanks. I had taken a looked at the number of commits which didn't have many after the last release so I thought it hadn't been merged yet.

https://github.com/tesseract-ocr/tesseract/graphs/commit-act...

ocrcustomserver · on Nov 10, 2017

It uses LSTM for the line recognizer.

cptskippy · on Nov 9, 2017

> I played around with the open source Tesseract OCR, but it was way behind commercial desktop packages in accuracy/usability. It lagged especially in dealing with different document layouts.

How long ago did you use it? We use it to OCR photos of paper checks taken via smartphone and pull contact information out. It works pretty well.

philipkglass · on Nov 9, 2017

It was 4+ years ago. I don't remember exactly when, but the project was still hosted on Google Code at the time.

I was trying to OCR scanned scientific publications with multi-column layout and figures. I didn't expect anything to get all the captions right, or handle specialized notation, but it was important to recognize all the ordinary English words on the page and to have the text flow in the right order.

gebruikersnaam · on Nov 9, 2017

The alpha version (4.0) is significantly better than the stable version (3.05).

redindian · on Nov 10, 2017

I have a similar need. Is there any OSS projects which does this?

Snap a picture of a fixed layout form from smartphone & extract TYPED data

ocrcustomserver · on Nov 10, 2017

If you're looking for full page OCR, check out ocrmypdf (uses Tesseract).

If you want to extract data out of documents/forms then you need to develop your own solution (I'm doing work in this area) or use expensive packages like ABBYY FlexiCapture.

Images taken from a smartphone (compared to scanned documents) is going to make the problem harder.

frik · on Nov 10, 2017

CuneiForm was a close competitor to Abbyy that got open sourced, both were Russian companies. CuneiForm OCR is certainly better to Tesseract OCR. The open source Windows version is feature complete and very similar to Abbyy though it compiles only with VS 6. Some converted it to an Linux application, but most advanced features never got ported nor the great native UI is available - unfortunately. It would be great if a community would re-activate this open source OCR gem, though some Russian speaking devs are needed to translate the comments. (Tesseract OCR is based on 1995 HP OCR application and more behind than CuneiForm OCR) https://en.m.wikipedia.org/wiki/CuneiForm_(software)