Skip to content

RDFa ordering not preserved on duplicated properties #116

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
ivanprado opened this issue Jun 7, 2019 · 7 comments · Fixed by #139
Closed

RDFa ordering not preserved on duplicated properties #116

ivanprado opened this issue Jun 7, 2019 · 7 comments · Fixed by #139
Labels

Comments

@ivanprado
Copy link
Contributor

When a property is repeated (i.e. on a page with multiple images annotates as og:image) RDFa return it as a list but is not preserving order. Preserving order is important as usually the first image is the most important. An example of page where this would be happening:

https://cleantechnica.com/2019/04/16/fukushimas-final-costs-will-approach-one-trillion-dollars-just-for-nuclear-disaster/

It seems difficult to solve it in extruct as the problem seems to present in PyRdfa library, and it is even happening in the online service: https://www.w3.org/2012/pyRdfa/Overview.html#distill_by_uri+with_options

Related with #115 (I created an xfail test for that in this PR)

@kmike
Copy link
Member

kmike commented Jun 8, 2019

fwiw, extruct is using https://github.com/RDFLib/rdflib, not pyRdfa. This could be related (not 100% sure): RDFLib/rdflib#538.

@croqaz
Copy link
Member

croqaz commented Jul 16, 2019

What would be the implications to use SET instead or LIST in the RDFa values?
Eg: The values would be converted to sets of frozen Dicts:

"http://ogp.me/ns#image": [
                {
                    "@value": "http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg"
                },
                {
                    "@value": "http://images.sk-static.com/SECONDARY_IMAGE.jpg"
                }
            ],

@ivanprado
Copy link
Contributor Author

@croqaz the order is important for values. Usually, some pages annotate several images but the most important one (the one with the biggest resolution or more relevant) is the first one. If a set is used the order is lost. Instead, it would better to get consistent ordering by having values in the order they appear in the page. That's unfortunately not the case for RDFa.

@croqaz
Copy link
Member

croqaz commented Jul 16, 2019

Thank you for the explanation @ivanprado 👍

@adityas114
Copy link
Contributor

Hi, I would like to work on this. It would be great if you can assign me the issue.

I am thinking when the parsed output is returned from rdflib in rdfa.py, we can use the document object to fix the order in extruct itself. I have started working on it and I can raise a pull request soon.

@ivanprado
Copy link
Contributor Author

Hi @adityas114 . Nice if you can work on it 👍 . In order to contribute, just create a PR with the fix. There is no need to assign the issue.

adityas114 added a commit to adityas114/extruct that referenced this issue May 26, 2020
I fix the order in the json-ld string returned by rdflib by checking the correct order in the document object.
adityas114 added a commit to adityas114/extruct that referenced this issue May 26, 2020
The test case 'test_rdfa_not_preserving_order' should not call extruct.extract for only RDFa, but for all formats, since the expected json string also has all formats.
adityas114 added a commit to adityas114/extruct that referenced this issue May 26, 2020
Removed 'xfail' tag from 'tagtest_rdfa_not_preserving_order' test case as issue scrapinghub#116 is fixed.
adityas114 added a commit to adityas114/extruct that referenced this issue May 26, 2020
I fix the order in the json-ld string returned by rdflib by checking the correct order in the document object.
@adityas114
Copy link
Contributor

@ivanprado I have created a PR (#139) for this issue. Please have a look and let me know if it requires any changes.

adityas114 added a commit to adityas114/extruct that referenced this issue May 27, 2020
adityas114 added a commit to adityas114/extruct that referenced this issue May 27, 2020
lopuhin added a commit that referenced this issue Jul 9, 2020
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants