-
Notifications
You must be signed in to change notification settings - Fork 117
RDFa ordering not preserved on duplicated properties #116
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
fwiw, extruct is using https://github.com/RDFLib/rdflib, not pyRdfa. This could be related (not 100% sure): RDFLib/rdflib#538. |
What would be the implications to use SET instead or LIST in the RDFa values?
|
@croqaz the order is important for values. Usually, some pages annotate several images but the most important one (the one with the biggest resolution or more relevant) is the first one. If a set is used the order is lost. Instead, it would better to get consistent ordering by having values in the order they appear in the page. That's unfortunately not the case for RDFa. |
Thank you for the explanation @ivanprado 👍 |
Hi, I would like to work on this. It would be great if you can assign me the issue. I am thinking when the parsed output is returned from rdflib in rdfa.py, we can use the document object to fix the order in extruct itself. I have started working on it and I can raise a pull request soon. |
Hi @adityas114 . Nice if you can work on it 👍 . In order to contribute, just create a PR with the fix. There is no need to assign the issue. |
I fix the order in the json-ld string returned by rdflib by checking the correct order in the document object.
The test case 'test_rdfa_not_preserving_order' should not call extruct.extract for only RDFa, but for all formats, since the expected json string also has all formats.
Removed 'xfail' tag from 'tagtest_rdfa_not_preserving_order' test case as issue scrapinghub#116 is fixed.
I fix the order in the json-ld string returned by rdflib by checking the correct order in the document object.
@ivanprado I have created a PR (#139) for this issue. Please have a look and let me know if it requires any changes. |
Updated to test issue scrapinghub#116
Updated to test issue scrapinghub#116
When a property is repeated (i.e. on a page with multiple images annotates as
og:image
) RDFa return it as a list but is not preserving order. Preserving order is important as usually the first image is the most important. An example of page where this would be happening:https://cleantechnica.com/2019/04/16/fukushimas-final-costs-will-approach-one-trillion-dollars-just-for-nuclear-disaster/
It seems difficult to solve it in extruct as the problem seems to present in
PyRdfa
library, and it is even happening in the online service: https://www.w3.org/2012/pyRdfa/Overview.html#distill_by_uri+with_optionsRelated with #115 (I created an xfail test for that in this PR)
The text was updated successfully, but these errors were encountered: