![]() Registered Member ![]()
|
I have an idea for an application to automatically categorise and tag documents based on their contents.
To do this I need a frequency distribution of the words in the document. I have played around with the nepomuk examples and have a few clues about the tagging and rdf storage. I can't find much info on a per-document word list though - nepsak, nepoogle don't appear to show it, so maybe it's not stored in virtuoso? Is there a word list stored (eg: inverted vector index)? How does the full text search in Dolphin do its thing? Do I need to produce this list myself using libstreamanalyzer? I'd prefer not to do a second indexing pass. |
![]() Administrator ![]()
|
Given the type of question this is, you may want to ask the Nepomuk developers themselves directly as to how you might accomplish this. Please send a email to nepomuk@kde.org.
KDE Sysadmin
[img]content/bcooksley_sig.png[/img] |
![]() Registered Member ![]()
|
Firstly, for completeness, that email is for a mailing list and you can subscribe here: https://mail.kde.org/mailman/listinfo/nepomuk note that this list is pretty heavy on traffic about maintaining and shipping the Nepomuk infrastructure I got several responses (thanks) and basically the word lists are internal to Virtuoso (the database that holds and does the semantic content queries for Nepomuk). The closest thing to what I am after is that the Nepomuk property nie:plainTextContent contains the text extracted from a file; I'm going to have to post-process that to get what I need. Jörg Ehrichs also posted a URL to some source to an app that does some queries against Nepomuk for files: http://blog.6bytesmore.com/2011/12/resource-browser.html |
Registered users: Bing [Bot], Evergrowing, Google [Bot], rockscient