This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Is PDF Indexing Working?!

Tags: None
(comma "," separated)
CyberAngel
Registered Member
Posts
49
Karma
0

Is PDF Indexing Working?!

Thu Feb 23, 2012 3:40 pm
Hello,

Is PDF indexing working in KDE 4.8.0?

I am on Kubuntu 11.10 64 bits with KDE 4.8.0 and I don't get any results of PDFs I know they contain what I am searching for.

My File indexer is idle with 69,151 indexed files.
User avatar
google01103
Manager
Posts
6668
Karma
25

Re: Is PDF Indexing Working?!

Thu Feb 23, 2012 5:15 pm
since 4.7.3 Nepomuk is supposed to handle pdf's http://trueg.wordpress.com/2011/11/02/k ... y-release/

do you have pdftotext installed?


OpenSuse Leap 42.1 x64, Plasma 5.x

User avatar
Hans
Administrator
Posts
3304
Karma
24
OS

Re: Is PDF Indexing Working?!

Thu Feb 23, 2012 5:23 pm
Just a guess, you could try to force a reindex of a folder and see if it works. http://trueg.wordpress.com/2011/12/05/m ... s-is-easy/


Problem solved? Please click on "Accept this answer" below the post with the best answer to mark your topic as solved.

10 things you might want to do in KDE | Open menu with Super key | Mouse shortcuts
User avatar
Ignacio Serantes
Registered Member
Posts
453
Karma
1
OS

Re: Is PDF Indexing Working?!

Thu Feb 23, 2012 5:50 pm
As an additional information there is two command line utilities to know if a file is indexed by strigi analyzers:
  • rdfindexer
  • xmlindexer

use any of this utilities and pass your a problematic pdf as a parameter because the output information is the information added to nepomuk.

If you don't detect any problems with the indexing try an alternate method to query your database like Nepoogle. Try in Nepoogle the next query:

url:"my pdf file.pdf"

and if there is a result then the problem is in the query system and not in the indexers.


Ignacio Serantes, proud to be a member of KDE forums since 2008-Nov.
CyberAngel
Registered Member
Posts
49
Karma
0

Re: Is PDF Indexing Working?!

Thu Feb 23, 2012 8:49 pm
@google0113: pdftotext is installed.
@Hans: I made the reindexing and I saw the files not shown in the results got reindexed.

It is probably a mixed problem. Both indexers and queries...

xmlindexer and rdfindexer both works fine with the files I am testing.

I have a folder structure like this:

Code: Select all
~/Folder/SubFolder/MyFile.tex <- contains Latex code
~/Folder/SubFolder/MyFile.pdf <- contains the compiled pdf
~/Folder/SubFolder/MyFile-copy_longname.pdf <- Same file as MyFile.pdf but just a different and longer name.


- Now if I open dolphin, go to "~/Folder/SubFolder" and try to search for a word in the contents of the files from the current folder, the only result I get is MyFile.tex (even though the word is contained to both of the pdf files as well).

- If I open dolphin, go to "~/Folder" and try to search for the same word as before in the contents of the files from the current folder, the result I get is MyFile.tex and MyFile.pdf but NOT the MyFile-copy_longname.pdf.

- If I search from "~/Folder" now for a different word that I know it is contained to the tex and pdf files, I might get as a result only the MyFile.tex and none of the pdf files!

Generally, it looks like plain text files are indexed correctly, but PDFs not.

./nepoogle url:"filename.pdf" or nepoogle url:"filename.*" will return the name of the "MyFile.tex" and "MyFile.pdf" but never the name of the "MyFile-copy_longname.pdf".
So my assumption is that "MyFile-copy_longname.pdf" is not indexed at all even after the reindexing :(

Strange (buggy) behaviour...
User avatar
Ignacio Serantes
Registered Member
Posts
453
Karma
1
OS

Re: Is PDF Indexing Working?!

Thu Feb 23, 2012 9:29 pm
CyberAngel wrote:./nepoogle url:"filename.pdf" or nepoogle url:"filename.*" will return the name of the "MyFile.tex" and "MyFile.pdf" but never the name of the "MyFile-copy_longname.pdf".
So my assumption is that "MyFile-copy_longname.pdf" is not indexed at all even after the reindexing :(

Strange (buggy) behaviour...
Dolphin search is not working in my openSUSE distributions since KDE 4.6.0 so you may experienced the same problem. KRunner is working but results are limited so, if you want a reliable search with unlimited results the best solution at this time is Nepoogle. Nepoogle has it's own query engine and uses the Nepomuk Search Api too.

If Nepoogle is finding files then that files are indexed and, if Nepoogle is not finding files then that files are not indexed. I tried your example and copy a pdf file two times, one with a long name and other with a short name and Nepoogle find all the three files.

Consider that sometimes Nepomuk synchronization is not instant so you can try the next test:

1) open the console.
2) execute the next commands:
  • nepomukindexer "full path to your pdf file not indexed"
  • nepoogle url:"name of the file"

nepomukindexer forces Nepomuk to index/reindex the file so it's important than you use the full path.


Ignacio Serantes, proud to be a member of KDE forums since 2008-Nov.
CyberAngel
Registered Member
Posts
49
Karma
0

Re: Is PDF Indexing Working?!

Thu Feb 23, 2012 11:17 pm
I removed the nepomuk database so nepoogle could not find any result.
Then I started adding files one by one using nepomukindex.
What I realized is that very few pdf files can be actually indexed.

Then I used the find command to index only some pdfs recursively under one directory.
Total number of pdf files: 696.
Total number of indexed pdf files: 121

Code: Select all
find ~/mydir -name \*.pdf | wc -l
696


Code: Select all
find ~/mydir -name \*.pdf -exec nepomukindexer {} \;


Code: Select all
./nepoogle url:".*mydir.*.pdf"
Querying Nepomuk

....
....
121 records found in 0.047853 seconds.
--
Powered by nepoogle v0.8 (2012-02-01)
User avatar
bcooksley
Administrator
Posts
19765
Karma
87
OS

Re: Is PDF Indexing Working?!

Fri Feb 24, 2012 1:10 am
Can you please try to ask Nepomuk to manually index one of the PDF's which failed to be automatically indexed?

Can you tell if there is a particular pattern among the PDF files which could not be indexed?


KDE Sysadmin
[img]content/bcooksley_sig.png[/img]
CyberAngel
Registered Member
Posts
49
Karma
0

Re: Is PDF Indexing Working?!

Fri Feb 24, 2012 1:19 am
bcooksley wrote:Can you please try to ask Nepomuk to manually index one of the PDF's which failed to be automatically indexed?

Can you tell if there is a particular pattern among the PDF files which could not be indexed?


Shall I try it with the nepomukindexer command and get the output if any?
Is there another way to ask Nepomuk to manually index files?
User avatar
bcooksley
Administrator
Posts
19765
Karma
87
OS

Re: Is PDF Indexing Working?!

Fri Feb 24, 2012 1:21 am
The Nepomuk Indexer command is the one I was referring to there, yes.


KDE Sysadmin
[img]content/bcooksley_sig.png[/img]
CyberAngel
Registered Member
Posts
49
Karma
0

Re: Is PDF Indexing Working?!

Fri Feb 24, 2012 2:03 am
Some files they give a blank output and exit status 0 (that means everything went alright?)
I have tried to copy and index those files even to my default home directory (that there are no spaces in the path)

Some other files they give a "Error in parsing: Keyword endstream not found." output, but exit status 0 as well!

Others give this kind of error but again an exit status of 0:
Error: Z_DATA_ERROR while inflating stream.
Error in parsing:

I get more different errors for some of the pdfs (always the exit status of the nepomukindexer command is 0 though.. Shouldn't be like this..) but even for some that I don't get any error, they still can't be indexed.

I ran a "echo $?" exactly after the "nepomukindex /home/user/fullpath/file.pdf" command, to get the exit status.

One pdf impossible to index and without any output errors from the nepomukindex command, is the MLN manual you can get from sourceforge here: http://mln.sourceforge.net/doc/mln-manual.pdf or here http://dl.dropbox.com/u/3397346/mln-manual.pdf the version I have.
User avatar
bcooksley
Administrator
Posts
19765
Karma
87
OS

Re: Is PDF Indexing Working?!

Fri Feb 24, 2012 2:10 am
Ok, it appears that the PDF handling in Strigi needs some fixes to work with all PDF files. Could you please file a bug at bugs.kde.org regarding that - attaching the PDF files which fail to index if possible.


KDE Sysadmin
[img]content/bcooksley_sig.png[/img]
CyberAngel
Registered Member
Posts
49
Karma
0

Re: Is PDF Indexing Working?!

Fri Feb 24, 2012 2:12 am
bcooksley wrote:Ok, it appears that the PDF handling in Strigi needs some fixes to work with all PDF files. Could you please file a bug at bugs.kde.org regarding that - attaching the PDF files which fail to index if possible.


Sure!
I do that now.
CyberAngel
Registered Member
Posts
49
Karma
0
User avatar
Ignacio Serantes
Registered Member
Posts
453
Karma
1
OS

Re: Is PDF Indexing Working?!

Fri Feb 24, 2012 8:36 am
CyberAngel wrote:KDE Bug opened here
https://bugs.kde.org/show_bug.cgi?id=294727
In general, when you use nepomukindexer with a file and that file is not located by nepoogle index process was failing.

As I wrote xmlindexer an rdfindexer are a good tools to try to detect problems.

I do a try with the pdf you attach to the bug report and is true that nepomukindexer has an error parsing the file but the file is indexed and is visible to nepoogle in my system.

On the other side, using pdftk-gui to fix the pdf file and after fix the pdf there is no error indexing the pdf.

In openSUSE this is the packages and versions I have installed:
  • kdegraphics-strigi-analyzer - 4.8.0-24.1
  • kdesdk4-strigi - 4.8.0-225.1
  • libstrigi0 - 0.7.6-65.3
  • libstrigi0-32bit - 0.7.6-65.3
  • strigi - 0.7.6-65.3
  • strigi-devel - 0.7.6-65.3


Ignacio Serantes, proud to be a member of KDE forums since 2008-Nov.


Bookmarks



Who is online

Registered users: bancha, Bing [Bot], Google [Bot], Sogou [Bot]