WebCrawler

Submission of Team MA-217164 in Web Crawler competition organised by TechFest , IIT Bombay

Contributed by:

Problem Statement

We had to develop a web crawler which would identify the following key components :

SSL certificate compliance – Check all links in the site for URL validation of SSL (all hyperlinks should be https://), and verify the validity of the SSL certificate.
Cookie checker – Verify cookies being used by the website, the cookie checker will scan the cookies on the website, and cookie consent verification links.
ADA compliance
- Alt text in all images.
- Color contrast for the site as per w3.org guidelines.
- Accessibility issues to check the site markup for null tab index

Approach

The user can run the program individually for each type of problem. There is also a combined script for all executing all tasks.
We have used streamlit to render the results in a web-interface instead of displaying it in the terminal. The SSL Certificate details (if enabled), cookies present, verification attribute, info regarding null tab index and the image tags without alt text of a website are displayed in a local URL.

Libraries used :

ssl
socket
prettytable
streamlit
beautifulsoup
requests
urllib

Usage

Clone the repository git clone https://github.com/Rajarshi1001/webCrawler.git
Install the requirements pip install -r requirements.txt
py pip install streamlit Specify the url using --link option while executing the script.

This Script displays the ssl details, verification & details about the cookies being used by the website, img tags without alt-text and null tab index. (e.g = https://github.com)

Run the following command

py -m streamlit run script.py -- --link https://github.com

For Alt-text :

Firstly head to cd .\webCralTF\webCralTF\ (yes twice) then run

scrapy crawl spidey

For color-contast :

Now head to the correct directory then run

python colContr.py

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
__pycache__		__pycache__
webCralTF		webCralTF
LICENSE		LICENSE
README.md		README.md
alttext.py		alttext.py
check.py		check.py
cookies.py		cookies.py
requirements.txt		requirements.txt
script.py		script.py
tabindex.py		tabindex.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebCrawler

Contributed by:

Problem Statement

Approach

Libraries used :

Usage

This Script displays the ssl details, verification & details about the cookies being used by the website, img tags without alt-text and null tab index. (e.g = https://github.com)

Run the following command

For Alt-text :

For color-contast :

About

Releases

Packages

Contributors 4

Languages

License

Rajarshi1001/webCrawler

Folders and files

Latest commit

History

Repository files navigation

WebCrawler

Contributed by:

Problem Statement

Approach

Libraries used :

Usage

This Script displays the ssl details, verification & details about the cookies being used by the website, img tags without alt-text and null tab index. (e.g = https://github.com)

Run the following command

For Alt-text :

For color-contast :

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages