Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Extract hyperlinks from wayback machine #501

Closed
yxzhu16 opened this issue Oct 5, 2020 · 3 comments
Closed

Extract hyperlinks from wayback machine #501

yxzhu16 opened this issue Oct 5, 2020 · 3 comments

Comments

@yxzhu16
Copy link
Contributor

yxzhu16 commented Oct 5, 2020

Describe the bug
When extracting links from https://web.archive.org/web/20200903005938/https://www.bbc.com/pidgin which is from Wayback machine, only several links are showing up and most useful links are missing.

To Reproduce
Steps to reproduce the behavior (e.g.):

  1. Load a WARC crawled from https://web.archive.org/web/20200903005938/https://www.bbc.com/pidgin
  2. Extract links
  3. Not all of the hyperlinks are showing up

Expected behavior
Around 100 hyperlinks should show up, and should at least include https://www.bbc.com/pidgin/world-54001522.

Screenshots
Screen Shot 2020-10-05 at 12 54 56 PM

Environment information

  • AUT version: 0.80.1-SNAPSHOT
  • OS: MacOS 10.15.6
  • Java version: Java 11
  • Apache Spark version: 3.0.1
  • Apache Spark w/aut: --jars
  • Apache Spark command used to run AUT: run with jupyter notebook
@ianmilligan1
Copy link
Member

Thanks for the issue @yxzhu16.

I looked at the source of the page, and here's the HTML where the missing link comes from:

<a href="/web/20200903005938/https://www.bbc.com/pidgin/world-54001522" class="Link-sc-1dvfmi3-5 StyledLink-sc-16i2p1z-2 fdDiSd">Five Tyler Perry movies wey make serious money for di Hollywood newest billionaire</a>

Is it possible our ExtractLinks use of jsoup isn't picking out those re-written links because they're non-traditional?

@schmika
Copy link

schmika commented Oct 6, 2020

Hi,
I've recently come across the same issue and I think it's because the link references a relative instead of an absolute URL.
In the AUT Scala code, ExtractLinks can have 3 parameters:

* @param src the src link
* @param html the content from which links are to be extracted
* @param base an optional base URI

The base URI is required to resolve relative URLs using link.attr("abs:href"). So I think you have to specify a base URI to be able to extract all links.
At the moment, however, the Python UDF extract_links only expects 2 parameters, if I understand the code correctly. It may be necessary to adapt the Python UDF to include the base parameter.

@ianmilligan1
Copy link
Member

Fantastic stuff, @yxzhu16 – thanks so much for the pull request (and for the info on this too @schmika – much appreciated).

ruebot added a commit that referenced this issue Jan 18, 2021
Set baseUri to be `src` instead of `base` when extracting links, and deleted `base` parameter.

The issue occurred because relative links cannot be extracted by ` link.attr("abs:href")` when baseUri is not set.
As I look through the code, param `base` is never provided anywhere when `ExtractLinks` is called, so default value `""` is always used, and baseUri is never set. However, `baseUri` is required to be able to extract relative links. 

* resolves #501 
* update tests
* remove unnecessary test results comment

Co-authored-by: Kai Zhong <kaizhchn@hotmail.com>
Co-authored-by: nruest <ruestn@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

4 participants