This forum has been archived. All content is frozen. Please use KDE Discuss instead.

make review/annotation data indexable by strigi/nepomuk

10

Votes
10
0
Tags: okular okular okular
(comma "," separated)
mutlu
Registered Member
Posts
75
Karma
0
OS
Edit: I changes the title to reflect the status of this wish. Ina discussion with Okular's maintainer it turned out that ideas 1 and 2 are already implemented. Moreover, a backup solution wish for KDE is in a separate thread in this forum. What it comes down to is thus mainly points 5 and 6: the requests to integrate Okular's annotation data with strigi/nepomuk so that searches return the document on which the comments have been made.

I have a huge library of journal articles and books in PDF format and I use Okular to annotate them. (For those who don't know this feature: Okular -> Tools -> Review (or simply hit F6).)

The annotations are stored in XML files in $KDEHOME/share/apps/okular/docdata, one XML file for each document. This leads to a number if problems with rather simple solutions:

1.) If I move a PDF document, the data is not associated with it any more. Thus, the document url recorded in the XML file should automatically be updated to the new location.

2.) If I want to move a document to a different device, I lose the annotation data. For this purpose, I propose that it should be possible to generate archives (e.g. TAR files) from within Okular which can be imported by Okular on a different machine. During the import, the document url would of course be changed accordingly.

3.) Okular should offer an easy means to backup all XML files in $KDEHOME/share/apps/okular/docdata. I currently do this with a little bash script, but desktop environment users should not have to do this.

4.) The $KDEHOME/share/apps/okular/docdata folder can grow quite large if you work a lot with PDFs. There should be a means of recognizing whether (a) an associated document still exists and (b) whether any metadata is associated (apart from the mere recording of which page of the document was open the last time it was used). Based on these criteria, it should be able to clean up the folder.

Moreover, I think there is a huge potential to integrate two of the pillars of KDE with Okular:

5.) Strigi/nepomuk should index the XML files and recognize that this is metadata that belongs to the PDF files linked in them. If I search for a comment I made using Okular, I would like the linked PDF to show up.

6.) The metadata could be stored in Akonadi. This would bring Nepomuk indexing and backup possibilities to Okular metadata.


P.S. I want to thank the Okular developers. This is an amazing piece of software and I am highly indebted to you. It makes it possible for me to do my work using Linux F/OSS and not 'rely' on proprietary software. You rock!

Last edited by mutlu on Wed Apr 01, 2009 3:40 pm, edited 1 time in total.
pinotree
KDE Developer
Posts
222
Karma
7
OS
mutlu wrote:1.) If I move a PDF document, the data is not associated with it any more. Thus, the document url recorded in the XML file should automatically be updated to the new location.

Not true.
Currently, as long as it has the same name and size, annotations are preserved, so they are lost only if the file is renamed or actually changes, but not on plain move.

2.) If I want to move a document to a different device, I lose the annotation data. For this purpose, I propose that it should be possible to generate archives (e.g. TAR files) from within Okular which can be imported by Okular on a different machine. During the import, the document url would of course be changed accordingly.

File -> Export as -> Document archive. The result can be open "transparently" by Okular again.

3.) Okular should offer an easy means to backup all XML files in $KDEHOME/share/apps/okular/docdata. I currently do this with a little bash script, but desktop environment users should not have to do this.

What is there is purely *internal*, so you should no care about the format of what is there specifically.
If you want to backput your KDE data, just backup $KDEHOME/share/apps as a whole, while for single documents see my reply on point 2), or just backup that as a whole.

4.) The $KDEHOME/share/apps/okular/docdata folder can grow quite large if you work a lot with PDFs. There should be a means of recognizing whether (a) an associated document still exists and (b) whether any metadata is associated (apart from the mere recording of which page of the document was open the last time it was used). Based on these criteria, it should be able to clean up the folder.

There's a wish reported for the former "privacy" kcontrol module to do that; in KDE 4 would be the Sweeper application. (Sorry, bugzilla does not load here right now, so I cannot check for sure.)

5.) Strigi/nepomuk should index the XML files and recognize that this is metadata that belongs to the PDF files linked in them. If I search for a comment I made using Okular, I would like the linked PDF to show up.

6.) The metadata could be stored in Akonadi. This would bring Nepomuk indexing and backup possibilities to Okular metadata.

Given that the format used internally is well... private, we will not expose that to indexers of any kind.


Pino Toscano
mutlu
Registered Member
Posts
75
Karma
0
OS
Thanks for the insanely quick reply!

pinotree wrote:
mutlu wrote:1.) If I move a PDF document, the data is not associated with it any more. Thus, the document url recorded in the XML file should automatically be updated to the new location.

Not true.
Currently, as long as it has the same name and size, annotations are preserved, so they are lost only if the file is renamed or actually changes, but not on plain move.

2.) If I want to move a document to a different device, I lose the annotation data. For this purpose, I propose that it should be possible to generate archives (e.g. TAR files) from within Okular which can be imported by Okular on a different machine. During the import, the document url would of course be changed accordingly.

File -> Export as -> Document archive. The result can be open "transparently" by Okular again.

You are right, when I moved files, I had also OCR'ed and thus changed them. I also wasn't aware of the export function. This is pure bliss!

pinotree wrote:
mutlu wrote:3.) Okular should offer an easy means to backup all XML files in $KDEHOME/share/apps/okular/docdata. I currently do this with a little bash script, but desktop environment users should not have to do this.

What is there is purely *internal*, so you should no care about the format of what is there specifically.
If you want to backput your KDE data, just backup $KDEHOME/share/apps as a whole, while for single documents see my reply on point 2), or just backup that as a whole.

Yes, I have scripted something like this. I guess this would be moot anyway given that there is a popular wish for a general KDE backup system.

pinotree wrote:
mutlu wrote:4.) The $KDEHOME/share/apps/okular/docdata folder can grow quite large if you work a lot with PDFs. There should be a means of recognizing whether (a) an associated document still exists and (b) whether any metadata is associated (apart from the mere recording of which page of the document was open the last time it was used). Based on these criteria, it should be able to clean up the folder.

There's a wish reported for the former "privacy" kcontrol module to do that; in KDE 4 would be the Sweeper application. (Sorry, bugzilla does not load here right now, so I cannot check for sure.)

You probably mean https://bugs.kde.org/show_bug.cgi?id=130496 This doesn't really do what I imagine, though, since it would simply delete all data. Given the minuscule size of these files, it isn't really of high priority, though. :)

pinotree wrote:
mutlu wrote:5.) Strigi/nepomuk should index the XML files and recognize that this is metadata that belongs to the PDF files linked in them. If I search for a comment I made using Okular, I would like the linked PDF to show up.

6.) The metadata could be stored in Akonadi. This would bring Nepomuk indexing and backup possibilities to Okular metadata.

Given that the format used internally is well... private, we will not expose that to indexers of any kind.

I don't know if I understood you correctly. It seems to me that you are saying that, given that the XML files are located in $KDEHOME, they should not be exposed to crawlers. Many search clients already do this, however. Beagle, for example, indexes Kmail's directory. Strigi will do so from 4.3 onward, too. That it doesn't is considered a bug by Sebastian Trüg. See http://www.mail-archive.com/nepomuk-kde ... 00257.html

To make indexing the Okular data meaningful, however, a (not yet existing) strigi analyzer would have to relate the XML data to the linked url.

Thanks for your response.
pinotree
KDE Developer
Posts
222
Karma
7
OS
mutlu wrote:
pinotree wrote:
mutlu wrote:4.) The $KDEHOME/share/apps/okular/docdata folder can grow quite large if you work a lot with PDFs. There should be a means of recognizing whether (a) an associated document still exists and (b) whether any metadata is associated (apart from the mere recording of which page of the document was open the last time it was used). Based on these criteria, it should be able to clean up the folder.

There's a wish reported for the former "privacy" kcontrol module to do that; in KDE 4 would be the Sweeper application. (Sorry, bugzilla does not load here right now, so I cannot check for sure.)

You probably mean https://bugs.kde.org/show_bug.cgi?id=130496 This doesn't really do what I imagine, though, since it would simply delete all data. Given the minuscule size of these files, it isn't really of high priority, though. :)

The wish is not just "remove blindly", but it could mean the cleaning function added to Sweeper would be configurable based on parameters like:
- the document exists (as pointed by the URL in the XML)
- the XML is more than __ [days/months] old
- etc

pinotree wrote:
mutlu wrote:5.) Strigi/nepomuk should index the XML files and recognize that this is metadata that belongs to the PDF files linked in them. If I search for a comment I made using Okular, I would like the linked PDF to show up.

6.) The metadata could be stored in Akonadi. This would bring Nepomuk indexing and backup possibilities to Okular metadata.

Given that the format used internally is well... private, we will not expose that to indexers of any kind.

I don't know if I understood you correctly. It seems to me that you are saying that, given that the XML files are located in $KDEHOME, they should not be exposed to crawlers.
[/quote]
Not really. My position is that the format used by Okular to store information internal information is internal.
This means it can change anytime and in any way you can think about. And it also means external applications should not read them, not even change them!
To make indexing the Okular data meaningful, however, a (not yet existing) strigi analyzer would have to relate the XML data to the linked url.

Given on the above, a Strigi analyzer could be useless whenever the internal format is changed. Given that Strigi is not a KDE application and its releases are not related at all with KDE ones, you can imagine this might not be the best.


Pino Toscano
mutlu
Registered Member
Posts
75
Karma
0
OS
[quote='pinotree'][quote='mutlu'][quote='pinotree']
[quote='mutlu']
5.) Strigi/nepomuk should index the XML files and recognize that this is metadata that belongs to the PDF files linked in them. If I search for a comment I made using Okular, I would like the linked PDF to show up.

6.) The metadata could be stored in Akonadi. This would bring Nepomuk indexing and backup possibilities to Okular metadata.
[/quote]
Given that the format used internally is well... private, we will not expose that to indexers of any kind.
[/quote]
I don't know if I understood you correctly. It seems to me that you are saying that, given that the XML files are located in $KDEHOME, they should not be exposed to crawlers.
[/quote]
Not really. My position is that the format used by Okular to store information internal information is internal.
This means it can change anytime and in any way you can think about. And it also means external applications should not read them, not even change them!
[quote='mutlu']
To make indexing the Okular data meaningful, however, a (not yet existing) strigi analyzer would have to relate the XML data to the linked url.
[/quote]
Given on the above, a Strigi analyzer could be useless whenever the internal format is changed. Given that Strigi is not a KDE application and its releases are not related at all with KDE ones, you can imagine this might not be the best.
[/quote]

OK, I see your point. However, I think you can imagine the vast increase in usefulness if all comments made to a PDF document were actually searchable. As the maintainer, you are surely in a better position to judge whether it is possible to declare at least parts of the Okular XML stable, for example the "documentInfo" string and the content of the string "base contents" in annotations of type 1. If this is not possible, could this information maybe be pushed into nepomuk?

Edit: Zwabel thinks along the same lines: http://zwabel.wordpress.com/2009/03/29/ ... formation/
As does Jos: http://www.kdedevelopers.org/node/3923

Last edited by mutlu on Mon Mar 30, 2009 11:30 pm, edited 1 time in total.


Bookmarks



Who is online

Registered users: Bing [Bot], daret, Google [Bot], sandyvee, Sogou [Bot]