Skip to content

Fix for scrapinghub#116 #139

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 28 commits into from
Jul 9, 2020
Merged

Fix for scrapinghub#116 #139

merged 28 commits into from
Jul 9, 2020

Conversation

adityas114
Copy link
Contributor

@adityas114 adityas114 commented May 26, 2020

Fixes #116. I fixed the order in rdfa.py by checking the correct order in the document object. I also fixed test 'test_rdfa_not_preserving_order'.

I fix the order in the json-ld string returned by rdflib by checking the correct order in the document object.
The test case 'test_rdfa_not_preserving_order' should not call extruct.extract for only RDFa, but for all formats, since the expected json string also has all formats.
Removed 'xfail' tag from 'tagtest_rdfa_not_preserving_order' test case as issue scrapinghub#116 is fixed.
I fix the order in the json-ld string returned by rdflib by checking the correct order in the document object.
@codecov
Copy link

codecov bot commented May 26, 2020

Codecov Report

Merging #139 into master will increase coverage by 1.09%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #139      +/-   ##
==========================================
+ Coverage   88.06%   89.15%   +1.09%     
==========================================
  Files          11       12       +1     
  Lines         486      535      +49     
  Branches      108      121      +13     
==========================================
+ Hits          428      477      +49     
  Misses         52       52              
  Partials        6        6              
Impacted Files Coverage Δ
extruct/rdfa.py 100.00% <100.00%> (ø)
extruct/__init__.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a64ce58...78c76a3. Read the comment docs.

@adityas114 adityas114 closed this May 26, 2020
@adityas114 adityas114 changed the title Fix for issue #116 Fix for scrapinghub#116 May 26, 2020
@adityas114 adityas114 reopened this May 26, 2020
@ivanprado
Copy link
Contributor

Thank you @adityas114 for your contribution! I'll have a look in the following days. I already have a quick look and have some comments:

  • I wonder if the approach is the correct. I mean, the proposed code is patching somehow the result from the source library (RDFlib or pyRDFa, not sure. See RDFa ordering not preserved on duplicated properties #116 (comment)). A better fix would be to fix the original library. But also seems that extucts pinned the rdflib version probably because newer versions introduces some incompatibilities. So then maybe patching is the right thing to do. Not sure. I'll see the code during the next days and probably I'll have it clearer.
  • I see that the code is iterating several times over the tree. I wonder how this is impacting the performance. Could you measure the impact on the performance of your changes? Also, wonder if speed could be improved if you use xpath expressions in some places instead of transversing the tree.
  • There are some parts of your code that remain untested. See the codecov report: https://codecov.io/gh/scrapinghub/extruct/pull/139/diff?src=pr&el=tree#diff-ZXh0cnVjdC9yZGZhLnB5. Tests covering them will be required.

@lopuhin
Copy link
Member

lopuhin commented May 27, 2020

But also seems that extucts pinned the rdflib version probably because newer versions introduces some incompatibilities. So then maybe patching is the right thing to do.

#135 is also related

@adityas114
Copy link
Contributor Author

Thank you @ivanprado for you comments.

  • I agree that ideally it should be fixed in the original library, but I believe that maybe we can patch it in extruct itself for the time being.
  • Yes xpath may improve performance, but I am not familiar with it so I didn't use it for now. I'll look into it.
  • Okay I'll add more tests where required.

Let me know if there is any more feedback 😄

Opengraph output is not correct when I updated the 'elysianfields.html' file.
Opengraph output is not correct when I updated the 'elysianfields.html' file.
Optimised performance of 'fixOrder' function by using xpath.
Fixed minor bugs
@adityas114
Copy link
Contributor Author

@ivanprado

I have added tests to improve code coverage, and I have used xpath to improve efficiency.

For testing, I edited 'elysianfields.html', however after changing two other tests fail. From what I understand, it seems that there is an issue with the implementation of open graph protocol. I have added xfail tags for those 2 tests.

Let me know what you think 😄

Copy link
Contributor

@ivanprado ivanprado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adityas114 for the changes! After a first review I think the overall approach makes sense to me 👍 . The principles should be:

  • Only altering the final order if possible, but never the content. (I think this is the followed approach)
  • In case of failure in the ordering process, fallback to the original content (this is also covered, but I would prefer to have the new code wrapped in a try/except section that fallbacks to default in case of error)
  • Performance should not be considerable affected. I did some tests and this seems to be case, but I would prefer if you also do some tests: run some pages n times with/without the fix and compare the times.

Also, I left some comments

@ivanprado
Copy link
Contributor

Something also remaining:

  • Documenting clearly that this is a hack to fix the ordering and that it should be disabled once PyRDFA fixes itself.
  • Adding more tests with corner cases (e.g. bad namespace)

@ivanprado
Copy link
Contributor

Hi @adityas114 .

Firstly, thank you again for collaborating with this fix. It is very appreciated 👍

I left some additional comments.

Copy of the original file
Copy of the original file
"test_rdfa_not_preserving_order" uses the "elysianfields_1.html" file now
Removing xfail tag from "test_uopengraph"
Add bad property to test for corner case
Optimized sort, fix for corner case (bad properties)
@adityas114
Copy link
Contributor Author

adityas114 commented Jun 17, 2020

Hi @ivanprado .

I added some changes, please check them out. I believe I have incorporated all your suggestions - let me know if I missed something 😄

Also, I did a test to compare the performance. I called the extract method 100 times, with the 'elysianfields_1.html' file as parameter. Without patch, it ran in 13.070286s, and with the patch it ran in 13.093408s. Clearly, the performance is only marginally affected.

Let me know if you have more suggestions, I'd be happy to help 👍 😄

Copy link
Contributor

@ivanprado ivanprado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adityas114 thank you again for these changes. I really appreciate. I think we are pretty close to having the PR ready to be merged, all my comments were addressed. I found a few new issues that you can find in the comments.

@adityas114
Copy link
Contributor Author

I have made all the suggested changes. Please review 😄

Copy link
Contributor

@ivanprado ivanprado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adityas114 It looks great 👏 ! Thank you very much for contributing 😄

I leave the issue open a while just in case somebody ones to review (/cc @Gallaecio @lopuhin ) and then I'll merge it.

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @adityas114 , approach looks good overall 👍 Left some questions/suggestions below.

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and fixes an important problem, thanks @adityas114 👍

@lopuhin lopuhin merged commit bd49a5f into scrapinghub:master Jul 9, 2020
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RDFa ordering not preserved on duplicated properties
3 participants