Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Spider that crawls pages through ajax "click" #257

Open
suntong opened this issue Jan 2, 2017 · 8 comments
Open

Spider that crawls pages through ajax "click" #257

suntong opened this issue Jan 2, 2017 · 8 comments

Comments

@suntong
Copy link

suntong commented Jan 2, 2017

The first parameter of artoo.ajaxSpider is,

urlList array | function : the list of urls to request through ajax or, alternatively, a function taking as arguments the index of the iteration and the data of the last request, and returning either the desired url or false to break the spider.

This means that artoo ajaxSpider only follows urls that either pre-given or somewhat calculated.
However, It is possible for artoo's ajaxSpider to,

  1. Find the "Next"-page url from the first page, then
  2. "Click" and follow that url onto the following pages?

The reason I'm asking is that github code search can only work in browser. Nothing else works. I.e., if you click/paste the following url, you will only get the "We could not perform this search" error.

https://github.com/search?utf8=%E2%9C%93&q=%22github.heygears.com%2Fgoadesign%2Fgoa%2Fdesign%2Fapidsl%22+language%3Ago&type=Code&ref=searchresults

However, if you do github code search in browser, searching for

"github.com/goadesign/goa/design/apidsl" language:go

Then click on the 2nd choice on the left, "Code", you will get "We’ve found 294 code results" and a list of all the hits. If you compare the url, you will find that this "working" url is exactly as above. Try with different search items, and try paste the "working" url in a new browser window five-minutes later, you will find that the "working" url is no longer working.

This is why I need artoo's ajaxSpider to click and follow that "Next" url.
Is this possible? Thanks!

@suntong
Copy link
Author

suntong commented Jan 2, 2017

Here is the script that I prepared to make it easier for you to get started,

var definition = {
  iterator: 'div.code-list > div.code-list-item',
  data: {
    FullName: {sel: 'p > a:nth-child(1)'}, // extract Text!
    FileName: {sel: 'p > a:nth-child(2)'}, // extract Text!
    UpdatedAt: {sel: 'span.updated-at > relative-time', attr: 'datetime'},
    Language: {sel: 'span.language'} // extract Text!
  }
};

artoo.ajaxSpider(
  function(i) {
    # click and follow url of "divp.pagination a.next_page" -> "href"
  }, {
    scrape: definition,
    limit: 9,
    concat: true,
    throttle: 500,
    done: function(data) {
      artoo.log.debug('Finished retrieving data. Downloading...');
      artoo.saveCsv(data);
    }
  }
);

@Yomguithereal
Copy link
Member

Hello @suntong. The ajaxSpider can only use ajax and does therefore retrieve only static HTML. What you need is either to find the correct way to query github (by spoofing user-agent or else etc. to make the site believe your are a regular user) or use more complex solutions such as PhantomJS or build some Chrome extension to create an automaton. You can alternatively try sandcrawler to do so. I can try to help you but cannot do so before next week unfortunately.

Alternatively, this might be a stupid question but isn't the Github API able to give you the information you need? I expect them to have pretty good scraping & crawling defenses.

@suntong
Copy link
Author

suntong commented Jan 5, 2017

Thank you Guillaume for your reply.

isn't the Github API able to give you the information you need?

I thought so, but having failed to get it, and double-checked that I've done nothing wrong, I wrote Github a question, and here is what they replied:

(now) it's not possible to perform global code searches via the API as mentioned in this blog post:

https://developer.github.com/changes/2013-10-18-new-code-search-requirements/

We'd like to allow global code searches via the API in the future, but I can't promise when it will happen. So in the meantime, code search must be scoped to a user, organization, or repository e.g. scoping to user:james2doyle:

As for,

You can alternatively try sandcrawler to do so. I can try to help you but cannot do so before next week unfortunately.

That'd be really appreciated. I've confirmed that Scrapinghub Portia isn't able to handle it, even with the JavaScript feature turned on. Now I've confirmed that artoo can't handle it either. I.e., I've exhausted all the tools I know. If I can't make sandcrawler work, I won't be sure it would be my limitation, or it is naturally impossible. I'll be inclined to believe the latter. So if you can draw an conclusion that it is impossible, then it'd be the last nail I need.

Don't worry about time. As long as you keep it in mind and find sometime to look into it later.
Thanks

@Yomguithereal
Copy link
Member

Can I ask you what you are spoofing when performing your HTTP calls?

@suntong
Copy link
Author

suntong commented Jan 5, 2017

Sorry I don't quite understand the question -- are you talking about Scrapinghub, or artoo, or doing in the browser, or ...?

  • When doing in the browser, I didn't do anything special.
  • When visiting the url, I just click on the above link in the browser.

@Yomguithereal
Copy link
Member

Using any tool really. What are the headers you send? Do you handle cookies, session etc.?

@suntong
Copy link
Author

suntong commented Jan 8, 2017

No I didn't do anything special, other than visiting them in the browser.

@Yomguithereal
Copy link
Member

So you either need to watch the HTTP queries made so you can replicate them the best way you can or else you can also use artoo to build some automaton that will:

  1. Click on the next button
  2. Wait for the results to be rendered
  3. Collect the data
  4. Loop until finished

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants