This forum has been archived. All content is frozen. Please use KDE Discuss instead.
Please use bugs.kde.org for bug reports or feature requests. Development related questions should be directed to the okular-devel mailing list.

okular and ocr: the mystery

Tags: None
(comma "," separated)
linnit
Registered Member
Posts
4
Karma
0

okular and ocr: the mystery

Fri Apr 02, 2010 6:17 am
I just downloaded a bunch of free public domain books from archive.org. They're available as pdf files that contain scanned images from the original book. I seem to remember trying to select, copy and paste sections of text from those images a while back using Okular and was only presented with Image related options on the context menu.

Image
Copy to clipboard
Copy to file

After trying again today to select and copy sections of text from those pdfs using Okular I am now presented with Text options on the context menu. These are still scanned images so the only way I could be copying text from these pdfs is if Okular is performing ocr on the selected area. So what exactly is happening? Does Okular use an external ocr application to process selected areas?

You can test this issue with the following pdfs. For an example try selecting some text on page 50 in both files and then copy the text.

http://www.archive.org/download/occultj ... 00lowe.pdf

http://www.archive.org/download/etidorh ... 00lloy.pdf
plcl
Registered Member
Posts
9
Karma
0

Re: okular and ocr: the mystery

Fri Apr 02, 2010 8:55 am
Even more magical: search for some word, like "Tokyo", in the book "Occult Japan".

The explanation is that the PDF document contains both the OCR'd text and the scanned image in two layers. This format allows indexing the text contents of a facsimile.

https://www.luratech.com/products/docum ... essor.html
john_hudson
Registered Member
Posts
549
Karma
2
OS

Re: okular and ocr: the mystery

Mon Apr 05, 2010 10:25 am
It all depends on the original scanner software. UNESCO has scanned in some of its early documents using the Acrobat Capture 3 plugin that does simultaneous OCR and passes the text to the PDF in the same way as OpenOffice or pdftex do and thus creates PDFs from which the text can be extracted. But Google books, for example, simply scans the page image so that the PDF contains a series of images rather than any text.

NB UNESCO warns you that the OCR is not infallible and there are occasional errors in the PDFs but not enough to cause problems.

What Okular can do depends entirely on how the PDF was originally created, whether as a series of images or as text. If you look in Properties it will normally tell you.


John Hudson, proud to be a member of KDE forums since 2008-Oct.


Bookmarks



Who is online

Registered users: Bing [Bot], claydoh, Evergrowing, Google [Bot], rblackwell