Registered Member
|
Hi all,
I'm trying to convert an embedded text based PDF to an editable form. I've read around on the internet that KWord used to be able to do this, but from what I've read and seen (just installed KWord 2.1.2 on Ubuntu 10.04) this functionality was dropped in KOffice 2. Can someone confirm this? Does anyone know if this function is being brought back/worked on? KWord appears to be the only program that could do this so removing support seems like a crazy idea from a USP point of view! Thanks in advance! |
Moderator
|
Yes it was removed from kword. But now, there is a PDF importer for karbon.
|
Manager
|
there's also an extension for OpenOffice http://extensions.services.openoffice.o ... /pdfimport
of course it may/may not work well |
Registered Member
|
Wow, thanks for the fast response! Unfortunately the new importer isn't much use to me. I have just tried it and tried exporting as all the different file formats supported, but they are all drawing formats. The most promising is the Opendocument Drawing format which is essentially XML. But each character (Yes, character!) is a separate drawing object. So I would need to write a script to group adjacent characters as paragraphs, and then convert the object type to be a Document text box or something. Does anyone know a good way of doing this? |
Registered Member
|
Wow2, thanks for the further info! FYI, I just tried the OO import extension and that worked much better than the Karbon one in that it had single line sentences instead of characters. The only promising option given by OO is to save a XHTML. This actually writes a text based HTML file which is editable! The XHTML renders great in IE but not in Firefox. I suspect I can just copy and paste into a Document to edit. This is all a bit of a pain though. Can't someone just write an XSLT script to convert odg to odt? Or even better just add a convertor into OO? Thanks all! |
Manager
|
there are other possiblities out there:
http://en.wikipedia.org/wiki/Pdftotext http://pdfedit.petricek.net/en/index.html and some are web based (some require email address) http://www.convertpdftoword.net/ (I tried this one, worked for me) http://www.freepdfconvert.com/ http://www.zamzar.com/ |
Moderator
|
Okular can too export pdf to text (File → Export as..).
|
Registered Member
|
Assuming the text is embedded in the PDF as text, Okular works fine for extracting text. You can also extract images by selecting them and saving them. The only problem you may encounter is that high quality text may have ligatures in it which, as they are legitimate Unicode characters, will be saved with the file but cause problems with spell-checkers and some programs which don't recognise ligatures as valid characters.
You can extract text which is embedded in a graphic by scanning it using an OCR.
John Hudson, proud to be a member of KDE forums since 2008-Oct.
|
Registered users: Bing [Bot], claydoh, Google [Bot], markhm, rblackwell, sethaaaa, Sogou [Bot], Yahoo [Bot]