-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
publication format - JATS conversion and alternatives #3
Comments
I heard back from the arXMLiv people. The project is still running but the progress seems to be slowed down due to the interference of other projects.
So for the articles under the arXiv default license it seems impossible to mirror them in any form, be it PDF or XML. Do I understand correctly, that this torpedoes our idea of a arxiv datasource in its current form? |
@LGro you are correct that at the moment, the only target format is JATS. There is a lot to gain if we keep JATS as the first-class supported format imo, even if we eventually provide some second-class support for other formats. Much of this will be easier to explain once I have finished my first draft of the docs in this repo. But here's a brief attempt at summary. JATS gives us semantic markup in a format many publishers are already (at least in principle) committed to. If we get medium-to-high-quality JATS we can have:
If we get poor-quality JATS, we can firstly automatically convert it to the best we can get (and gradually improve this over time), and secondly have an interface that allows users to collectively improve the underlying JATS by annotating the papers. And if we succeed in building a strong user community, we will be producing an ecosystem of tools for working with JATS (building, automatic and manual improving, displaying, etc.), as well as providing an incentive for publishers to standardise around the format. On the subject of producing JATS - LaTeXML produces a different XML that needs converting to JATS - but in my experiments converts quite nicely and can easily be cleaned up.
For the arXiv license issue - indeed most are not open access, but are available for data mining and I'm certain we can find a working solution for this particular case. |
On the subject of PDF.js, I think we should not serve PDFs directly but parse them into a DOM through PDF.js and have a standard way to transliterate that into JATS. If we do this, people can improve the documents and processes over time, whereas if we keep PDF there is little opportunity to drive progress in the user experience, the tools, or publisher's behaviour. There are a lot of tools out that solve parts of the problem and I have been meaning to put my experiments with them into a repo so we can iterate on a pipeline. |
Currently the goal seems to be handling/displaying only JATS files. Accordingly, the main challenge for publishers that do not offer JATS/XML sources for publications is to convert them.
I checked out pandoc-jats which does not seem to work too well, especially with formulas. Does someone have more experience with different parametrisations that might work?
The LaTeXML produces a XML file ten times bigger than pandoc-jats for which my local lens reader does not stop loading.
Tralics by the INRIA people does not seem to work out of the box.
The arXMLiv project apparently attempts to convert all of arxiv to machine readable content. I just contacted one of the project members to see what the current project status is. Since the main PhD student of this project is now working for LaTeXML, I guess it's worth looking into that.
Observing that the mass conversion of LaTeX to XML is something people spend PhDs on, we might want to consider alternatives in case we can't come up with a well working conversion pipeline for ScienceFair.
I would suggest valuing a good coverage of the scientific literature out there over covering only those articles that are available in JATS. What do you think about this and would you be open to - at least as a start - also incorporate PDF/Postscript versions of publications to be displayed with a PDF reader (like pdf.js) integrated into ScienceFair?
I hope this fits in the strategy repo rather than the specific sciencefair-land/sciencefair-datasource-arxiv.
The text was updated successfully, but these errors were encountered: