-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Extract hyperlinks from wayback machine #501
Comments
Thanks for the issue @yxzhu16. I looked at the source of the page, and here's the HTML where the missing link comes from: <a href="/web/20200903005938/https://www.bbc.com/pidgin/world-54001522" class="Link-sc-1dvfmi3-5 StyledLink-sc-16i2p1z-2 fdDiSd">Five Tyler Perry movies wey make serious money for di Hollywood newest billionaire</a> Is it possible our |
Hi,
The base URI is required to resolve relative URLs using |
Set baseUri to be `src` instead of `base` when extracting links, and deleted `base` parameter. The issue occurred because relative links cannot be extracted by ` link.attr("abs:href")` when baseUri is not set. As I look through the code, param `base` is never provided anywhere when `ExtractLinks` is called, so default value `""` is always used, and baseUri is never set. However, `baseUri` is required to be able to extract relative links. * resolves #501 * update tests * remove unnecessary test results comment Co-authored-by: Kai Zhong <kaizhchn@hotmail.com> Co-authored-by: nruest <ruestn@gmail.com>
Describe the bug
When extracting links from https://web.archive.org/web/20200903005938/https://www.bbc.com/pidgin which is from Wayback machine, only several links are showing up and most useful links are missing.
To Reproduce
Steps to reproduce the behavior (e.g.):
Expected behavior
Around 100 hyperlinks should show up, and should at least include https://www.bbc.com/pidgin/world-54001522.
Screenshots
Environment information
The text was updated successfully, but these errors were encountered: