This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Proposal for a bibliographic system

Tags: None
(comma "," separated)
drdanz
KDE Developer
Posts
20
Karma
1
OS
For bibliographic metadata there is this ontology: http://bibliontology.com/, it should be possible to import it in nepomuk.
I don't think that it could be possible to extract the metadata from the pdf itself, but it could be possible to import them from a .bib file (or from other structured format) and perhaps also from the net.

Once the metadata are on nepomuk, it could be easy to write an exporter that fetches data from nepomuk and creates (for example) a .bib file


So, my idea is that we don't need KBibTex, but

1. database format - Bibliontology + nepomuk
2. a frontend to import/add metadata
3. a frontend for searching/exporting data (possibly a library to be able to include it in several apps)

Of course using nepomuk means that everything need to be rewritten, but in the long distance means that we will have several benefits like "metadata sharing", reasoning, complex sparql queries that go beyond just the content of the file, and that every applications will have access to the metadata
User avatar
staalmannen
Registered Member
Posts
18
Karma
0
OS
I think that one very important aspect of a bibliography tool is collaborative writing (which at least in my field of research is quite extensive)

The problem is sort of illustrated on this koffice wiki:
http://wiki.koffice.org/index.php?title=Bibliography

At my lab I am the only one using Linux, which means that I will have to take the lion share of interoperability work since all the rest are sitting with MS Word and Endnote. The Bibus + OpenOffice combo works OK and Endnote references can easily be imported and exported to Bibus, but I would love to have a real KDE solution to this. Bibus is really the only thing keeping me from switching to Koffice - sad but true...
zanoi
Registered Member
Posts
5
Karma
0
OS
I have to say, the paper management system described by TheBlackCat is an amazing idea. I never thought about using Akonadi for papers, but as far as I can see that's exactly the kind of problem that Akonadi is supposed to solve.

I think for the near future the most important thing would be to get a working KDE4 version of KBibtex though.
Chopstick
Registered Member
Posts
3
Karma
0
OS
I also think TheBlackCat's proposal is great! This is really exactly what we need. I've been thinking about using Strigi to index papers (full-text), but I haven't thought of the Akonadi integration with a kmail-like interface - that is really elegant!
I would be all up for that idea! The only concern I have at the moment is the usability and efficiency of Strigi; I don't know how far Akonadi is yet. I'm still using KDE 4.4, but maybe the improvements in KDE 4.5 are sufficient to make it usable?
A while ago I tried Mendeley, which is not open source (but many of my colleagues use it). In my opinion it is not much of an improvement over KBib (which I generally find better than kBibTeX, mostly because the online search never worked for me in kBibTeX), and exporting to BibTeX is very cumbersome.
If KBibTeX would be ported to KDE4 I would continue using that for a while, until TheBlackCat's idea becomes reality (which I am sure will happen at some point).

Chopstick
Tuukka
Registered Member
Posts
69
Karma
0
OS
I also really like TheBlackCat's idea about the bibliography management system. Akonadi seems a quickly developing project that will most likely be adopted by different kinds of KDE (and other) programs, giving it even more momentum. It really makes sense to exploit the work on features and stability that has gone and will go to Akonadi. Writing a own, home-brewn solution is an unnecessary burden (think of maintainability!). And it also works the other way: if Akonadi is used by a biobliography management project, it will help it get more features and bugfixes that are useful also for others. Regarding the maturity of Akonadi and Strigi, I think if something like this will ever be brought to an anywhere finished state, I'm quite confident they'll be quite far in terms of stability and features by then.

I personally keep my papers as pdf files organized in a directory tree. I would be quite happy to have just ability for strigi to fetch metadata for my pdf files -- the lack of metadata (i.e. citation) prevents me from migrating to any other more sophisticated system. I would be happy even with just pdf files with the citation associated to them with nepomuk, even that would be an improvement.

Well, the morale of my story is that the strigi part is quite important for migrating from a file library to a proper database. To those who don't like the idea of strigi indexing your all your files: you can restrict it to just certain folders!
thijsdetweede
Registered Member
Posts
20
Karma
0
OS
The reason why I'd advocate at least to keep the file library next to the proper DB, is portability. Being dependent on a specific database is very much against the LaTeX philosophy.

I often share my bibtex file amongst various workplaces, co-authors, and have to send it in to journals together with my article. The advantage of a system like KBibtex or Jabref is that I know that my bibliography.bib is always up to date. Furthermore, I can use any editor I like, so Kile/Kate/Vim/whatever does not have to be adjusted.
User avatar
TheBlackCat
Registered Member
Posts
2945
Karma
8
OS
That is another advantage of akonadi, it can be made to use different file backends. So if someone wants to use bibtex, it would be able to keep the data synced with that. If someone wants to use csv, they can do that. You would be able to use multiple different back-ends simultaneously, or none at all, sync them automatically or manually, import and export them with little or no additional work for the programmers, and store them on a network. Bibtex is good for people who want to use bibtex, but that is not the only popular reference format.


Man is the lowest-cost, 150-pound, nonlinear, all-purpose computer system which can be mass-produced by unskilled labor.
-NASA in 1965
thijsdetweede
Registered Member
Posts
20
Karma
0
OS
That's true, I guess. And in no way do I want to be bibtex-obstructive. It is just a crucial feature for me - the one that makes me stay at a non-KDE4 app as we speak.

Would it be a bad idea to first make sure that KBibtex/KPapers is working the way it should, and then start porting the backend to nepomuk/akonadi?
Tuukka
Registered Member
Posts
69
Karma
0
OS
Some discussion about technical issues:

1) Retrieving metadata from PDF files. I think it is not a good idea to try to parse all possible information from PDFs because it would be error prone and much work. Just the minimal information to identify a paper would suffice. If there is a DOI string, this is straightforward, but otherwise one most likely needs per-journal citation extracters, which would get the journal name and year, vol, number (or title). This all could be done with a strigi StreamAnalyzer, but I'd suggest a separate tool using the Poppler library, for example.

2) Given some piece of information, retrieve the rest of the citation. If we have a DOI, we can use dx.doi.org to find the publishers page for the article. With journal, year, vol, num information one can use Google Scholar or PubMed to search for the publisher's site (which requires parsing the search engine's html output) or write per-journal web address generators.

When the publisher's site is found, the citation (and full text link) needs to be parsed from the HTML. I think the best solution would be to use the Zotero javascript site translators, even though that involves also porting the Zotero API functions. IMO, writing our own translators would be too much dull work. However, a short term solution might be parsing the metadata tags that many publishers add to the pages. They are sometimes incomplete, though.

3) I final point about how PDF files should be handled: I think it is important to allow the user the control over his/her pdf archive on a local hard disk. I believe the best solution is not to incorporate the PDFs in the database, but attach the citation metadata to them with nepomuk and fetch them on-demand using a nepomuk search. If searching the local disk fails, the system would try to download the full text from the publisher's site. There could be an option to automatically save the pdf to a user-defined location.

This kind of functionality (fetching citations and finding PDFs) does not depend on the way the database is handled and could be used in KBibTex, for example.
User avatar
TheBlackCat
Registered Member
Posts
2945
Karma
8
OS
Tuukka wrote:3) I final point about how PDF files should be handled: I think it is important to allow the user the control over his/her pdf archive on a local hard disk. I believe the best solution is not to incorporate the PDFs in the database, but attach the citation metadata to them with nepomuk and fetch them on-demand using a nepomuk search. If searching the local disk fails, the system would try to download the full text from the publisher's site. There could be an option to automatically save the pdf to a user-defined location.

Yes, this is exactly how akonadi works, and one of the reasons I suggested it. It store the files in normal files, but uses nepomuk to keep track of the files and store information about them and links between them.


Man is the lowest-cost, 150-pound, nonlinear, all-purpose computer system which can be mass-produced by unskilled labor.
-NASA in 1965
Tuukka
Registered Member
Posts
69
Karma
0
OS
I wrote a small python script to demonstrate basic citation fetching functionality. It scans a PDF file for a DOI string and uses dx.doi.org to access the publisher site. It then parses the metadata tags of the webpage and prints some citation info. DOI extraction usually works if a DOI string is present in the PDF but the citation fetching does not always work. However, at least IOP journals seem to cooperate quite nicely. The output is something like:

DOI:10.1088/0953-8984/20/22/225009
Authors: Nosonovsky, Michael ; Bhushan, Bharat
Journal: Journal of Physics: Condensed Matter
Date: 2008-06-04
Volume: 20
First page: 225009
Title: Roughness-induced superhydrophobicity: a way to design non-adhesive surfaces
Publisher: IOP Publishing

The fetching is quite slow because the server response can take some seconds. Therefore, indexing a large PDF library takes time. The next step would be attaching the metadata to the PDF files with Nepomuk. However, a new Nepomuk ontology would be needed (based on bibliontology?).

If someone wants to look at the script, it is here: http://www.tkk.fi/u/tyverho/citation.py. If you want to run it you need poppler-qt4 python bindings, which need to be compiled by hand: http://code.google.com/p/python-poppler-qt4/. For me, it required some tweaking of the dirs in configure.py.
thijsdetweede
Registered Member
Posts
20
Karma
0
OS
Looks good. My inclination is that the server-lookup time does not need to matter too much, since it is not CPU time of the user, and it could probably be parallelizd anyway. If not, may be the mass import function of crossref can be used, and if not, than it can't be helped anyway.
ad_267
Registered Member
Posts
14
Karma
0
One thing I think is important is to include support for annotations, which I assume would come for free with the Okular KPart. But it would also be nice to have notes associated with each reference, which could be a small paragraph or a larger text document rather than annotations on the PDF. These notes could possibly include support for LaTeX math.

Maybe this is a silly question, but with Akonadi, would it be simple to include references without any associated document or with multiple documents, eg for books with each chapter in a separate PDF?

I think a new application would need to be written, rather than updating KBibTeX. KBibTeX is really a BibTeX file manager, but I think it would be more useful to have a reference manager that syncs collections of references with BibTeX files.

This is how Mendeley works, which is what I currently use for reference management. It's great, but it has some problems and being closed source it's difficult to get them addressed. The main problems I have with it are mediocre LaTeX/BibTeX support, no support for long/short journal names, it only supports viewing PDF files, and it's difficult to sync PDF files between machines if you don't want to pay for extra space on their servers. You can't set a file base path and sync file locations between machines so I had to write a python script to update the Mendeley database to get around this.
User avatar
TheBlackCat
Registered Member
Posts
2945
Karma
8
OS
ad_267 wrote:One thing I think is important is to include support for annotations, which I assume would come for free with the Okular KPart. But it would also be nice to have notes associated with each reference, which could be a small paragraph or a larger text document rather than annotations on the PDF. These notes could possibly include support for LaTeX math.

That would be a comment, which nepomuk seems to support normally.

ad_267 wrote:Maybe this is a silly question, but with Akonadi, would it be simple to include references without any associated document or with multiple documents, eg for books with each chapter in a separate PDF?

This would be essential.


Man is the lowest-cost, 150-pound, nonlinear, all-purpose computer system which can be mass-produced by unskilled labor.
-NASA in 1965
meyerm
Registered Member
Posts
18
Karma
0
OS
ad_267 wrote:One thing I think is important is to include support for annotations, which I assume would come for free with the Okular KPart.


While there is potential, the current annotations of okular are not that good. And a more basic problem with it is that the authors want to stay independant from file formats (which is an understandable goal!). So annotating directly into a PDF ist not possible.


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], kesang, Yahoo [Bot]