-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Spider that crawls pages through ajax "click" #257
Comments
Here is the script that I prepared to make it easier for you to get started,
|
Hello @suntong. The Alternatively, this might be a stupid question but isn't the Github API able to give you the information you need? I expect them to have pretty good scraping & crawling defenses. |
Thank you Guillaume for your reply.
I thought so, but having failed to get it, and double-checked that I've done nothing wrong, I wrote Github a question, and here is what they replied:
As for,
That'd be really appreciated. I've confirmed that Scrapinghub Portia isn't able to handle it, even with the JavaScript feature turned on. Now I've confirmed that Don't worry about time. As long as you keep it in mind and find sometime to look into it later. |
Can I ask you what you are spoofing when performing your HTTP calls? |
Sorry I don't quite understand the question -- are you talking about Scrapinghub, or artoo, or doing in the browser, or ...?
|
Using any tool really. What are the headers you send? Do you handle cookies, session etc.? |
No I didn't do anything special, other than visiting them in the browser. |
So you either need to watch the HTTP queries made so you can replicate them the best way you can or else you can also use artoo to build some automaton that will:
|
The first parameter of
artoo.ajaxSpider
is,This means that artoo ajaxSpider only follows urls that either pre-given or somewhat calculated.
However, It is possible for artoo's ajaxSpider to,
The reason I'm asking is that github code search can only work in browser. Nothing else works. I.e., if you click/paste the following url, you will only get the "We could not perform this search" error.
https://github.com/search?utf8=%E2%9C%93&q=%22github.heygears.com%2Fgoadesign%2Fgoa%2Fdesign%2Fapidsl%22+language%3Ago&type=Code&ref=searchresults
However, if you do github code search in browser, searching for
Then click on the 2nd choice on the left, "Code", you will get "We’ve found 294 code results" and a list of all the hits. If you compare the url, you will find that this "working" url is exactly as above. Try with different search items, and try paste the "working" url in a new browser window five-minutes later, you will find that the "working" url is no longer working.
This is why I need artoo's ajaxSpider to click and follow that "Next" url.
Is this possible? Thanks!
The text was updated successfully, but these errors were encountered: