Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

publication format - JATS conversion and alternatives #3

Open
LGro opened this issue Jul 15, 2017 · 3 comments
Open

publication format - JATS conversion and alternatives #3

LGro opened this issue Jul 15, 2017 · 3 comments

Comments

@LGro
Copy link

LGro commented Jul 15, 2017

Currently the goal seems to be handling/displaying only JATS files. Accordingly, the main challenge for publishers that do not offer JATS/XML sources for publications is to convert them.

I checked out pandoc-jats which does not seem to work too well, especially with formulas. Does someone have more experience with different parametrisations that might work?

The LaTeXML produces a XML file ten times bigger than pandoc-jats for which my local lens reader does not stop loading.

Tralics by the INRIA people does not seem to work out of the box.

The arXMLiv project apparently attempts to convert all of arxiv to machine readable content. I just contacted one of the project members to see what the current project status is. Since the main PhD student of this project is now working for LaTeXML, I guess it's worth looking into that.

Observing that the mass conversion of LaTeX to XML is something people spend PhDs on, we might want to consider alternatives in case we can't come up with a well working conversion pipeline for ScienceFair.
I would suggest valuing a good coverage of the scientific literature out there over covering only those articles that are available in JATS. What do you think about this and would you be open to - at least as a start - also incorporate PDF/Postscript versions of publications to be displayed with a PDF reader (like pdf.js) integrated into ScienceFair?

I hope this fits in the strategy repo rather than the specific sciencefair-land/sciencefair-datasource-arxiv.

@LGro
Copy link
Author

LGro commented Jul 16, 2017

I heard back from the arXMLiv people. The project is still running but the progress seems to be slowed down due to the interference of other projects.
They also pointed me to arxiv's licensing regulations:

Note: Most articles submitted arXiv are submitted with the default arXiv license which grants arXiv a perpetual, non-exclusive license to distribute the article but does not assign copyright to arXiv, nor grant arXiv the right to grant any specific rights to others. We are thus unable to grant others the right to distribute arXiv articles. If you build indexes or tools based on the full-text you must link back to arXiv for downloads. A small fraction of submissions are made with other licenses and this information is available in the OAI-PMH metadata.

So for the articles under the arXiv default license it seems impossible to mirror them in any form, be it PDF or XML. Do I understand correctly, that this torpedoes our idea of a arxiv datasource in its current form?
However, there are arxiv mirrors run/approved by the Cornell Library. So maybe we can get them on board to create a datasource themselves. Would be interesting to find out if dat's distributed hosting architecture conflicts with their licensing or if it's sufficient that they hold the key to control the arxiv datasource.
Do we already have contacts to arxiv / Cornell Library?

@blahah
Copy link
Member

blahah commented Jul 16, 2017

@LGro you are correct that at the moment, the only target format is JATS. There is a lot to gain if we keep JATS as the first-class supported format imo, even if we eventually provide some second-class support for other formats.

Much of this will be easier to explain once I have finished my first draft of the docs in this repo. But here's a brief attempt at summary.

JATS gives us semantic markup in a format many publishers are already (at least in principle) committed to. If we get medium-to-high-quality JATS we can have:

  • Flexible, stylable, customisable and extendable article display - right now this is a default display but by v2.0 people will be able to customise it to their liking.
  • Broad support for text and data-mining - converging around a single format allows us to build a layered ecosystem of modular tools for analysing papers, and this in turn allows all kinds of cool things in addition to the simple ability for users to perform analysis of their literature collections.
  • Annotation built-in. JATS supports annotating nodes arbitrarily. This allows us to have annotation datasources, for example (off the top of my head):
    • turning each mention of a chemical name into a link that expands a panel in the reader with the molecular formula + 3D model, physical properties, or links to protocols using the chemical.
    • allowing users to produce their own annotations (individually or in distributed groups), so that for example a lab could share their comments on specific claims, results or figures in a paper in a distributed way.

If we get poor-quality JATS, we can firstly automatically convert it to the best we can get (and gradually improve this over time), and secondly have an interface that allows users to collectively improve the underlying JATS by annotating the papers.

And if we succeed in building a strong user community, we will be producing an ecosystem of tools for working with JATS (building, automatic and manual improving, displaying, etc.), as well as providing an incentive for publishers to standardise around the format.

On the subject of producing JATS - LaTeXML produces a different XML that needs converting to JATS - but in my experiments converts quite nicely and can easily be cleaned up.

pandoc-jats is basic right now, but again, we can both improve it ourselves and help attract people interested in helping improve it.

For the arXiv license issue - indeed most are not open access, but are available for data mining and I'm certain we can find a working solution for this particular case.

@blahah
Copy link
Member

blahah commented Jul 16, 2017

On the subject of PDF.js, I think we should not serve PDFs directly but parse them into a DOM through PDF.js and have a standard way to transliterate that into JATS. If we do this, people can improve the documents and processes over time, whereas if we keep PDF there is little opportunity to drive progress in the user experience, the tools, or publisher's behaviour.

There are a lot of tools out that solve parts of the problem and I have been meaning to put my experiments with them into a repo so we can iterate on a pipeline.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants