This forum has been archived. All content is frozen. Please use KDE Discuss instead.

word lists - strigi? nepomuk?

Tags: None
(comma "," separated)
happy_heyoka
Registered Member
Posts
7
Karma
0
OS

word lists - strigi? nepomuk?

Tue Jul 17, 2012 2:51 pm
I have an idea for an application to automatically categorise and tag documents based on their contents.
To do this I need a frequency distribution of the words in the document.
I have played around with the nepomuk examples and have a few clues about the tagging and rdf storage.
I can't find much info on a per-document word list though - nepsak, nepoogle don't appear to show it, so maybe it's not stored in virtuoso?
Is there a word list stored (eg: inverted vector index)? How does the full text search in Dolphin do its thing?
Do I need to produce this list myself using libstreamanalyzer? I'd prefer not to do a second indexing pass.
User avatar
bcooksley
Administrator
Posts
19765
Karma
87
OS

Re: word lists - strigi? nepomuk?

Wed Jul 18, 2012 9:24 am
Given the type of question this is, you may want to ask the Nepomuk developers themselves directly as to how you might accomplish this. Please send a email to nepomuk@kde.org.


KDE Sysadmin
[img]content/bcooksley_sig.png[/img]
happy_heyoka
Registered Member
Posts
7
Karma
0
OS

Re: word lists - strigi? nepomuk?

Tue Aug 07, 2012 1:36 pm
bcooksley wrote:Given the type of question this is, you may want to ask the Nepomuk developers themselves directly as to how you might accomplish this. Please send a email to nepomuk@kde.org.

Firstly, for completeness, that email is for a mailing list and you can subscribe here:
https://mail.kde.org/mailman/listinfo/nepomuk note that this list is pretty heavy on traffic about maintaining and shipping the Nepomuk infrastructure

I got several responses (thanks) and basically the word lists are internal to Virtuoso (the database that holds and does the semantic content queries for Nepomuk).

The closest thing to what I am after is that the Nepomuk property nie:plainTextContent contains the text extracted from a file; I'm going to have to post-process that to get what I need.

Jörg Ehrichs also posted a URL to some source to an app that does some queries against Nepomuk for files:
http://blog.6bytesmore.com/2011/12/resource-browser.html


Bookmarks



Who is online

Registered users: Bing [Bot], Evergrowing, Google [Bot], rockscient