Skip to content

This piece of code Crawls the wikipedia articles one by one (First link in the page) until it encounters a duplicate.

Notifications You must be signed in to change notification settings

husainshaikh895/wikipedia-web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

wikipedia-web-crawler

This piece of code Crawls the wikipedia articles one by one (First link in the page) until it encounters the Language Article. I had heared somewhere that when you go on clicking the first link in every article on wikipedia, you'll end up on Philosophy page, but when i ran my program and returned the counts it turned out to be Language Article.

version 1.0

wherever you start there is high possibility that you'll encounter Language Article while Crawling. Does that mean This article has the most incoming links or is it something before this article?

we'll try to find out in future increments.

Features :

  • Uses a real WebBrowser for automation (Safari, Apple)
  • Shows pages traversed on terminal and saves it temporarily (Until program terminates)

Future ideas :

  • Count each page's appearance
  • Crawl a thousand times and find possible interconnections and rank them
  • Find the most interconnected Article
  • Support for other browsers

currently only supports Safari

Copyright :

Husain AKbar Shaikh

About

This piece of code Crawls the wikipedia articles one by one (First link in the page) until it encounters a duplicate.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages