Reply to topic

Proposal for a bibliographic system

ad_267
Registered Member
Posts
14
Karma
0
meyerm wrote:And a more basic problem with it is that the authors want to stay independant from file formats (which is an understandable goal!). So annotating directly into a PDF ist not possible.


I think we'd want to keep the ability to annotate any file format (I have some DjVu files for example), but an annotated PDF should also be able to be exported to share with others.
Tuukka
Registered Member
Posts
69
Karma
0
OS
ad_267 wrote:Maybe this is a silly question, but with Akonadi, would it be simple to include references without any associated document or with multiple documents, eg for books with each chapter in a separate PDF?


As TheBlackCat said, yes. The akonadi collection would be a representation of a bibliography file such as a Bibtex or BibLatex file (the code that is responsible of keeping the collection and the file in sync is called the backend, and there can be backends for different file formats). In my opinion, the best way to go with PDF files would be that the bibliography manager would query Nepomuk for PDF files with metadata corresponding to the citation. This way, the files would not be a part of the collection as such.

ad_267 wrote:One thing I think is important is to include support for annotations, which I assume would come for free with the Okular KPart. But it would also be nice to have notes associated with each reference, which could be a small paragraph or a larger text document rather than annotations on the PDF. These notes could possibly include support for LaTeX math.


ad_267 wrote:I think we'd want to keep the ability to annotate any file format (I have some DjVu files for example), but an annotated PDF should also be able to be exported to share with others.


Poppler (which Okular uses) does not support annotations currently, so exporting them is not possible at the moment.

Adding notes to the reference is no problem if the backend has support for comments (for example Biblatex has an "annotation" field). If the file format doesn't support comments, they can probably be included to the collection somehow but cannot be exported.

ad_267 wrote:I think a new application would need to be written, rather than updating KBibTeX. KBibTeX is really a BibTeX file manager, but I think it would be more useful to have a reference manager that syncs collections of references with BibTeX files.


Kbibtex for KDE4 is actually not as tied to Bibtex as the name suggests. The import filters pass the data to the program as a generic dictionary (key-value pairs) which can in principle contain any keys. The UI is propably designed for Bibtex though, but as most bibliographies have a similar format I think it could be adapted for other backends, too. At least the import and export filters (which are separated from the rest of the code) could definitely be reused.
User avatar TheBlackCat
Registered Member
Posts
2945
Karma
8
OS
Tuukka wrote:Poppler (which Okular uses) does not support annotations currently, so exporting them is not possible at the moment.

From what I have heard the git version of poppler, or maybe even a release version by this point, supports annotations. See here here. Okular, though, does not support this feature currently (although evince does).


Man is the lowest-cost, 150-pound, nonlinear, all-purpose computer system which can be mass-produced by unskilled labor.
-NASA in 1965
Tuukka
Registered Member
Posts
69
Karma
0
OS
I have evaluated TheBlackCat's ideas a bit and give my comments here.

TheBlackCat wrote:As I already pointed out in the other thread, strigi should be able to automatically detect and extract bibliographic information from every journal article on your hard drive. So all a user should have to do is save the article anywhere they want, strigi should then be able to detect it, extract the information, then store it to the database for easy retrieval later. So the okular integration is unnecessary.

As for extracting the information from the files, zotero can already do this.


Zotero looks for a DOI string in the PDF files (using poppler) and uses dx.doi.org to get the publisher's page. If a DOI string is not found, it takes some random string from the article and tries to find the publisher's page with Google Scholar. It doesn't really extract the data from the articles but from the web pages. We could do the same. Strigi indexes the full text content of PDF files and searching for DOI strings would be fast. The bibliography manager application could have a tool "Find citations from all PDF files on your disk".

TheBlackCat wrote:For a paper manager, this is how I envision it:

* Strigi is used for searching for papers already on the hard drive, extracting citations from papers based on layour and/or DOI, and for full-text indexing the document's content. It would also scan the paper's own references in order to keep track of connections between papers and make it easier to search for related papers.


It is not realistic to have strigi do this, but strigi indexes the full text content of files and that is useful (see above).

TheBlackCat wrote: * Akonadi is used to store and retrieve papers. Authors and journals are similar to contacts and papers are similar to emails, so the change necessary to implement this should be minimal. This also allows easy development of alternative front-ends and sharing papers over a network or storing them on a remote computer.


Using Akonadi would be a good idea. Bibliography files would be collections and the entries would be the items of the collection. Different backends (Bibtex, Biblatex, Tellico XML etc.) could use the same class representation for a bibliography entry, allowing the bibliography manager to support different formats natively.

I don't believe it'd be a good idea to try to incorporate the PDF full text files to the collection, but instead keep them somewhere on the disk and fetch them with Nepomuk on demand.

TheBlackCat wrote: * Akanadi is also used for retrieving citations from online indexes, like pubmed, both on-demand and automatically checking for new articles that fit certain criteria.


Akonadi cannot be used for online indexes because an Akonadi backend must provide a retrieveItems() function that gives a list of all items in the collections. With PubMed for example that is not possible.

TheBlackCat wrote: * Okular kpart is used for displaying papers


Yes. And as pointed out, the annotation capabilities of Okular could be used.

TheBlackCat wrote: * Konqueror/rekonq has integrated system similar to Zotero for retrieving papers and citations from web sites and storing them on the local drive


Good idea. I think this could consist of two parts. (1) A kind of library for retrieving citation info from a given URL (with a plugin system for specialized extractors for each publisher, like Zotero). (2) A KPart::Plugin for Konqueror that uses the library to look for interesting pages.

The library would not only be used by the plugin but also by the tool that scans for DOI strings from PDF files and looks for citation info from the publisher's page with dx.doi.org.

TheBlackCat wrote: * A kio slave is available to find and work with citations. This would ideally seamlessly integrate local searching with searching on article databases, and would make use of the strigi search interfaces.


Well, I'm not sure how useful a KIO slave would be. Perhaps it could be an alternate way to search and browse your bibliographies.

TheBlackCat wrote: * A plugin in koffice is used to format and integrate papers into office documents (a similar plugin could be used for openoffice or even MS word). The documents in a paper would be recorded by the akonadi backend, so you could easily retrieve a list of articles from a particular document you wrote.


I'm not sure what you mean with "integrate papers into office documents"... Do you mean add citations to a paper you are writing? That'd be useful, I suppose. Although I prefer using Latex :) .

TheBlackCat wrote: * A program similar to kmail is used to find, organize, and display papers. It would include saved searches, an okular part to view articles, and an advanced search interface similar to the dolphin Facets search.


Right.

TheBlackCat wrote: * A bibliography layout designer, accessible from both koffice and the main program, with GHNS integration for easily sharing layouts.


I'm not sure what you mean. Do you mean the layout of how the bibliography appears in the bibliography viewer or how your citations appear in the end of you office documents?
User avatar TheBlackCat
Registered Member
Posts
2945
Karma
8
OS
Tuukka wrote:Zotero looks for a DOI string in the PDF files (using poppler) and uses dx.doi.org to get the publisher's page. If a DOI string is not found, it takes some random string from the article and tries to find the publisher's page with Google Scholar. It doesn't really extract the data from the articles but from the web pages. We could do the same. Strigi indexes the full text content of PDF files and searching for DOI strings would be fast.

It seems you are correct, but as you said both approaches would be possible.

Tuukka wrote:The bibliography manager application could have a tool "Find citations from all PDF files on your disk".

I would think this could be done automatically. If strigi finds a DOI string when scanning PDFs it would automatically attempt to retrieve the data for it and add it to the document's metadata. I suppose the button would still be needed for the google scholar phrase search.

Tuukka wrote:
TheBlackCat wrote:For a paper manager, this is how I envision it:

* Strigi is used for searching for papers already on the hard drive, extracting citations from papers based on layour and/or DOI, and for full-text indexing the document's content. It would also scan the paper's own references in order to keep track of connections between papers and make it easier to search for related papers.


It is not realistic to have strigi do this, but strigi indexes the full text content of files and that is useful (see above).

Which wouldn't be realistic? Using the layout, or recording the citations in the paper? I see how using the layout might not be feasible, but keeping track of citations is essential (and one of the big benefits of nepomuk is that it keeps track of relationships between data, so I would think this would be natural for it).

Tuukka wrote:I don't believe it'd be a good idea to try to incorporate the PDF full text files to the collection, but instead keep them somewhere on the disk and fetch them with Nepomuk on demand.

I agree.

Tuukka wrote:Akonadi cannot be used for online indexes because an Akonadi backend must provide a retrieveItems() function that gives a list of all items in the collections. With PubMed for example that is not possible.

Why couldn't it be told to, for instance, retrieve the items 0 through 100 of a particular search? I would think this sort of thing would be essential, for instance for handling rss feeds (you aren't going to retrieve every single blog post from a blog with thousands or even tens of thousands of posts).

Tuukka wrote:Well, I'm not sure how useful a KIO slave would be. Perhaps it could be an alternate way to search and browse your bibliographies.

Yes, that was my intention.

Tuukka wrote:I'm not sure what you mean with "integrate papers into office documents"... Do you mean add citations to a paper you are writing? That'd be useful, I suppose. Although I prefer using Latex :) .

Yes, that is what I mean. A lot of people don't know latex, and to them having all these citations would be next to useless if there was no easy way to add the references to their papers. This is an essential feature.

Tuukka wrote:I'm not sure what you mean. Do you mean the layout of how the bibliography appears in the bibliography viewer or how your citations appear in the end of you office documents?

The layout of the bibliography in the paper. Journals have their own bibliography and inline reference formats they require. This would have a database of reference formats (probably using GHNS for easy sharing), as well as a WYSIWYG interface for designing new ones from scratch. These could then be directly incorporated into the paper or be used by the latex engine.


Man is the lowest-cost, 150-pound, nonlinear, all-purpose computer system which can be mass-produced by unskilled labor.
-NASA in 1965
Tuukka
Registered Member
Posts
69
Karma
0
OS
TheBlackCat wrote:
Tuukka wrote:Zotero looks for a DOI string in the PDF files (using poppler) and uses dx.doi.org to get the publisher's page. If a DOI string is not found, it takes some random string from the article and tries to find the publisher's page with Google Scholar. It doesn't really extract the data from the articles but from the web pages. We could do the same. Strigi indexes the full text content of PDF files and searching for DOI strings would be fast.

It seems you are correct, but as you said both approaches would be possible.

Tuukka wrote:The bibliography manager application could have a tool "Find citations from all PDF files on your disk".

I would think this could be done automatically. If strigi finds a DOI string when scanning PDFs it would automatically attempt to retrieve the data for it and add it to the document's metadata. I suppose the button would still be needed for the google scholar phrase search.

Tuukka wrote:
TheBlackCat wrote:For a paper manager, this is how I envision it:

* Strigi is used for searching for papers already on the hard drive, extracting citations from papers based on layour and/or DOI, and for full-text indexing the document's content. It would also scan the paper's own references in order to keep track of connections between papers and make it easier to search for related papers.


It is not realistic to have strigi do this, but strigi indexes the full text content of files and that is useful (see above).

Which wouldn't be realistic? Using the layout, or recording the citations in the paper? I see how using the layout might not be feasible, but keeping track of citations is essential (and one of the big benefits of nepomuk is that it keeps track of relationships between data, so I would think this would be natural for it).


The thing is that indexing complicated files such as PDFs is normally done with "endanalyzers", and you can have only one of those per file at a time. By not realistic I meant that modifying the built-in pdf analyzer is not feasible. Admittedly one could add a second analyzer if it's a "streamthroughanalyzer" but then the whole file should be loaded into memory. A limit for the file size would be then needed and that's not optimal.

Publishers usually seem to have a list of citations in the paper in the article's website so I'd prefer extracting them from there.

Even if we don't modify strigi we can have the bibliography application query nepomuk for new PDFs on the disk and have it automatically fetch metadata for them.

TheBlackCat wrote:
Tuukka wrote:I don't believe it'd be a good idea to try to incorporate the PDF full text files to the collection, but instead keep them somewhere on the disk and fetch them with Nepomuk on demand.

I agree.

Tuukka wrote:Akonadi cannot be used for online indexes because an Akonadi backend must provide a retrieveItems() function that gives a list of all items in the collections. With PubMed for example that is not possible.

Why couldn't it be told to, for instance, retrieve the items 0 through 100 of a particular search? I would think this sort of thing would be essential, for instance for handling rss feeds (you aren't going to retrieve every single blog post from a blog with thousands or even tens of thousands of posts).


That's just not how Akonadi is designed. It's for personal information management, and that normally involves finite size collections.

TheBlackCat wrote:
Tuukka wrote:Well, I'm not sure how useful a KIO slave would be. Perhaps it could be an alternate way to search and browse your bibliographies.

Yes, that was my intention.

Tuukka wrote:I'm not sure what you mean with "integrate papers into office documents"... Do you mean add citations to a paper you are writing? That'd be useful, I suppose. Although I prefer using Latex :) .

Yes, that is what I mean. A lot of people don't know latex, and to them having all these citations would be next to useless if there was no easy way to add the references to their papers. This is an essential feature.

Tuukka wrote:I'm not sure what you mean. Do you mean the layout of how the bibliography appears in the bibliography viewer or how your citations appear in the end of you office documents?

The layout of the bibliography in the paper. Journals have their own bibliography and inline reference formats they require. This would have a database of reference formats (probably using GHNS for easy sharing), as well as a WYSIWYG interface for designing new ones from scratch. These could then be directly incorporated into the paper or be used by the latex engine.


Ok. I don't know much about WYSIWYG editors.
roger_lf
Registered Member
Posts
10
Karma
0
OS
Sorry by bumping a old topic, but how is this project going? Is someone working in what was discussed here?

A bib management system as described here would be awesome. I would love to see this come to life.
User avatar TheBlackCat
Registered Member
Posts
2945
Karma
8
OS
As far as I am aware no one is working on this or is planning on working on this, unfortunately.


Man is the lowest-cost, 150-pound, nonlinear, all-purpose computer system which can be mass-produced by unskilled labor.
-NASA in 1965
Tuukka
Registered Member
Posts
69
Karma
0
OS
TheBlackCat wrote:As far as I am aware no one is working on this or is planning on working on this, unfortunately.


Actually, this is not entirely true :). I did write some code a couple of months ago and managed to demonstrated some of the concept we were discussing. Unfortunately, I got then more busy and abadoned the project for the time being. My plan was to come up with something that is ready to be used in one way or another to attract other people to join with the effort, but I ended up`leaving the code to a somewhat messy state.

Let's see what my code does:

Akonadi resource and serializer
Akonadi resources convert data between file backends (e.g. a bibtex or biblatex file) and a class representation that can be used in applications. I had to decide what this class representation would be and ended up using Bibliographic Ontology, because I wanted to stay as independent from a specific file format (such as biblatex) as possible. So the resource offers the data basically as RDF triples using the Bibo properties. There is no perfect 1-1 correspondence between Bibo and Bibtex or Biblatex, so there will be limitations in Biblatex support, but I think this approach is better than using one file format as the native data format...

At the present state the Bibtex/Biblatex backend is more or less working, but support for more fields should be added, conversion from Bibtex syntax to UTF should be improved and there are bugs. Also, so far the resource doesn't react to changes in the backend.

Konqueror Plugin and citation extractor
In the present state, the Konqueror plugin is able to extract citatation metadata from Science, Nature, ACS, APS, AIP, Wiley, Springer and Elsevier journals and save it to Akonadi. The plugin itself is very simple and most of the code is in a library called WebExtractor, which uses WebKit and python plugins to parse the publication websites. Each publisher has its own plugin. Compared to Zotero translator plugins, which can be very confusing, my python scripts are very simple and straightforward and quick to write. The library is independent from the browser so it could in principle be added as a plugin to any browser.

The plugin still doesn't save the actual PDF file. The saving itself would be simple to implement but the PDF would not be useful without attached Nepomuk metadata which would make it easy to connect a citation and the corresponding PDF file.

Which brings me to the main problem: there is no bibliographic ontology in Nepomuk. It is possible to install Bibo to Nepomuk and use it but it would not be a long term solution. The ontology is used to represent the data internally so if Nepomuk later gets something like Bibo and we'd want to switch to it, all the code would have to be modified. The right thing to do would be to contact the Nepomuk team and propose a bibliographic ontology, but that wouldn't make sense right now because the development of my project is stalled.

GUI for managing the citations
I have some ideas for it but I never actually planned to write one because it'd be too much work. I thought maybe someone else would do it. However, I wrote a widget to view the entries of a bibliography. Maybe when Qt5 comes with the capability to use QML for desktop apps the effort of writing a GUI becomes smaller.

Final words
I managed to keep the number of lines of code rather small, so this could be a nice starting point to anyone who wants to try his/her hand at making a bibligraphy management system. Though, there are no comments in the code and things could use some cleaning up... If someone manages to make some progress with it, I would most likely join to help :). I don't have an online source repository but the code is available from me by asking...
jmaspons
Registered Member
Posts
17
Karma
0
There are news! It seems that Joerg is implementing something similar to what is described in this topic. Take a look at http://joerg-weblog.blogspot.com/2012/0 ... ow-me.html

Joan
jmaspons
Registered Member
Posts
17
Karma
0
jmaspons
Registered Member
Posts
17
Karma
0
More news where Jörg meets Tuukka: http://joerg-weblog.blogspot.com/2012/0 ... ction.html

Good work guys!
easonawu
Registered Member
Posts
2
Karma
0
I think BibLaTeX supports unicode.

 
Reply to topic

Bookmarks



Who is online

Registered users: Baidu [Spider], Bing [Bot], carlitosh, exahamza, Google [Bot], jkurutz, razorrob, shevchuk, Sogou [Bot], Yahoo [Bot]