-
Notifications
You must be signed in to change notification settings - Fork 117
Fix for scrapinghub#116 #139
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
I fix the order in the json-ld string returned by rdflib by checking the correct order in the document object.
The test case 'test_rdfa_not_preserving_order' should not call extruct.extract for only RDFa, but for all formats, since the expected json string also has all formats.
Removed 'xfail' tag from 'tagtest_rdfa_not_preserving_order' test case as issue scrapinghub#116 is fixed.
I fix the order in the json-ld string returned by rdflib by checking the correct order in the document object.
Codecov Report
@@ Coverage Diff @@
## master #139 +/- ##
==========================================
+ Coverage 88.06% 89.15% +1.09%
==========================================
Files 11 12 +1
Lines 486 535 +49
Branches 108 121 +13
==========================================
+ Hits 428 477 +49
Misses 52 52
Partials 6 6
Continue to review full report at Codecov.
|
Thank you @adityas114 for your contribution! I'll have a look in the following days. I already have a quick look and have some comments:
|
#135 is also related |
Thank you @ivanprado for you comments.
Let me know if there is any more feedback 😄 |
Opengraph output is not correct when I updated the 'elysianfields.html' file.
Opengraph output is not correct when I updated the 'elysianfields.html' file.
Updated to test issue scrapinghub#116
Updated to test issue scrapinghub#116
Optimised performance of 'fixOrder' function by using xpath.
Fixed minor bugs
I have added tests to improve code coverage, and I have used xpath to improve efficiency. For testing, I edited 'elysianfields.html', however after changing two other tests fail. From what I understand, it seems that there is an issue with the implementation of open graph protocol. I have added xfail tags for those 2 tests. Let me know what you think 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adityas114 for the changes! After a first review I think the overall approach makes sense to me 👍 . The principles should be:
- Only altering the final order if possible, but never the content. (I think this is the followed approach)
- In case of failure in the ordering process, fallback to the original content (this is also covered, but I would prefer to have the new code wrapped in a try/except section that fallbacks to default in case of error)
- Performance should not be considerable affected. I did some tests and this seems to be case, but I would prefer if you also do some tests: run some pages n times with/without the fix and compare the times.
Also, I left some comments
Something also remaining:
|
Hi @adityas114 . Firstly, thank you again for collaborating with this fix. It is very appreciated 👍 I left some additional comments. |
Copy of the original file
Copy of the original file
"test_rdfa_not_preserving_order" uses the "elysianfields_1.html" file now
Removing xfail tag from "test_uopengraph"
Add bad property to test for corner case
Optimized sort, fix for corner case (bad properties)
Hi @ivanprado . I added some changes, please check them out. I believe I have incorporated all your suggestions - let me know if I missed something 😄 Also, I did a test to compare the performance. I called the extract method 100 times, with the 'elysianfields_1.html' file as parameter. Without patch, it ran in 13.070286s, and with the patch it ran in 13.093408s. Clearly, the performance is only marginally affected. Let me know if you have more suggestions, I'd be happy to help 👍 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adityas114 thank you again for these changes. I really appreciate. I think we are pretty close to having the PR ready to be merged, all my comments were addressed. I found a few new issues that you can find in the comments.
I have made all the suggested changes. Please review 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adityas114 It looks great 👏 ! Thank you very much for contributing 😄
I leave the issue open a while just in case somebody ones to review (/cc @Gallaecio @lopuhin ) and then I'll merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @adityas114 , approach looks good overall 👍 Left some questions/suggestions below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and fixes an important problem, thanks @adityas114 👍
Fixes #116. I fixed the order in rdfa.py by checking the correct order in the document object. I also fixed test 'test_rdfa_not_preserving_order'.