So Preview opens a file, which is apparently valid per Preview (Preview handles ...

dkonofalski · on Dec 16, 2020

If this is a bug with Preview, then that's really, really bad since Preview is a bedrock of macOS and has been for years.

However... it sounds like the issue is that FineReader is storing the OCR'd text in the metadata in a way that's not part of the official PDF spec. So, it sounds like Preview is able to open the file by ignoring that metadata and then, upon save, is storing the metadata back, as normal, which then corrupts the OCR data. This reminds me a lot of when people would store metadata like this in MP3 files to include things like album art and booklets. Normal mp3 players would ignore it as just metadata or bogus data but opening it in an audio editor would do this same thing.

I'm not sure who the "blame" lies with in this case because Abby FineReader probably is writing this stuff in a non-supported way but Preview really should just ignore it rather than trying to correct it. It's very likely that the OCR text, post-save, is actually bits from the document itself rather than from the metadata.

zepto · on Dec 16, 2020

“The idea that this behavior in a PDF reader can be excused because the software that generated the PDF was not approved for the operating system the viewer is running on per the vendor.”

Nobody is saying that. The suggestion is that the software that generates the PDF produces corrupt documents.

The fact that the vendor of that software doesn’t approve it for Big Sur suggests that they might be aware that there are problems.

jcrawfordor · on Dec 16, 2020

I worded it that way for two reasons, one of which is admittedly speculative:

1) It seems highly unlikely that ABBYY relies on some changed OS behavior in generating PDFs that leads to it producing PDFs that are malformed in such a way that is only revealed when they are rewritten by Preview. Behavior in Preview is by far a more likely cause of the problem. Generally the thing that changed is what broke...

2) To the user, this looks 100% like a problem in Preview no matter what's happening, and it's Apple's responsibility to not do this kind of thing to users. The PDF opens properly the first time, so de facto it is "valid" as determined by the product that later corrupts it. As I said, handling questionably valid PDFs is part and parcel of writing PDF software, and failure to handle a PDF that otherwise renders correctly looks like a bug on your part... especially when it otherwise renders correctly in your own software.

zepto · on Dec 16, 2020

I don’t really see why #1 is highly unlikely. PDF is very complex, and it’s easily possible that a generator could have a bug.

‘Generally the thing that changed is what broke’

This has been never been true in software engineering. Changed code reveals bugs which need to be fixed elsewhere all the time.

2. Yes, to the end user it looks like a problem with preview.

No, the fact that it opens at all doesn’t make it de facto “valid”.

Yes, handling questionable PDFs is part of writing PDFs handling software.

No, that doesn’t mean that all PDF handling software must or can feasibly handle any and all corruption.

The very fact that there are many kinds of questionably valid PDFs out there proves the point. Handling the the intersection of all the invalid PDFs is impossible.

Rendering correctly has nothing to do with this. PDFs have many attributes which are not rendered.

It really is on the document creator to produce a valid document in the first place.

It’s certainly on ABBYY to have tested this months ago and either fixed it, or publicized it.

dkonofalski · on Dec 16, 2020

>The PDF opens properly the first time, so de facto it is "valid" as determined by the product that later corrupts it.

While I agree with your main point that, to the user, this looks like a problem with Preview, I think it's actually because Preview is doing something beneficial to open the file which is to ignore "bogus" data. Preview, from what I can tell, ignores additional data that it doesn't expect specifically to allow for opening PDF files where the document data is fine but the metadata is corrupted. The fact that it opens it means nothing since the file would not be able to be opened at all otherwise and Preview is actually doing the user a favor by salvaging it. Once Preview "fixes" it, though, it looks like the OCR pointer is still there but the metadata that contains the plain-text content is not so it's pointing to binary document data instead. Another option is that Preview has "fixed" a font and the mapping is no longer correct which, while I can't really see the text in the image, would be obvious as you'd see words that map to the same corruptions. In either case, Preview's behavior is "correct" and the fact that it was less strict before does not mean that it's now broken - the source PDF is still what was "broken".

interestica · on Dec 17, 2020

I think a big issue is that 'preview' is used as just a pre-view. The metaphor is that you haven't actually opened/viewed the file yet 'for real'. Yet, preview has morphed into a program that makes changes to files even without any kind of save dialogue.

dkonofalski · on Dec 17, 2020

Are you confusing Quick Look with Preview? Preview is not just for PDFs and allows you to view, annotate, and edit lots of document types...

interestica · on Dec 18, 2020

No, really. Preview. Yes, it can open everything (I don't know where I suggested that it is only for PDFs?)

But it doesn't act like other editors. There's no save dialogue when you, for instance, rotate a PDF. The term 'Preview' in other programs, like when using a scanner or other text editors, is a non destructive type of viewer. Just a viewer. "preview before you make changes"

But preview on OSX changes that.

smarx007 · on Dec 16, 2020

This is ridiculous. When I read that FineReader is not supported on Big Sur, I understand it as I cannot run FineReader to _produce_ PDFs on Big Sur. But I expect Big Sur not to trash PDF files that were produced on Catalina, for example.

By the way, I was forced to purchase PDF Expert since a few years ago because of all kinds of problems with Preview (blurry text, wierd bugs with annotations, removed ability to print PDF with annotations and the list goes on).