The aim of this project was to train a classifier to distinguish URLs of phishing sites from ordinary URLs. The resulting model is used in a JSON API built with Flask, which accepts an arbitrary URL as input and responds with the generated features and predictions for that URL. The Flask app also serves a GUI webpage that visualizes the features of the given URL as percentiles of the overall phishing vs non-phishing data. The Flask app and Postgres database both run within Docker.
The app can be accessed at https://watermelon.calculist.io.
- Phishing URLs: https://www.phishtank.com/developer_info.php
- Random words list (used as search terms): https://randomwordgenerator.com/
- Bing Search API: https://docs.microsoft.com/en-us/azure/cognitive-services/bing-web-search/
- Web scraping
data/bing_search_results.csv
was generated byBing_random_word_search.ipynb
.data/phishtank2018-05-02_verified_online.csv.gz
was downloaded from https://www.phishtank.com/developer_info.php.data/random_words.txt
was generated and copy/pasted from https://randomwordgenerator.com/ and is used inBing_random_word_search.ipynb
.data/scraped_urls.txt
was generated byURL_Scraping.ipynb
.
This directory was intended to contain a sequence of .sql
files that update the database schema. See the run_migrations
method in project3data.py
to see how migrations are executed.
static/js/main.js
contains the main UI code.static/js/percentile_chart.js
contains the code for loading the percentile data and ranking a given URL.static/js/utils.js
contains some helper functions.static/index.html
is the base HTML of the UI.static/nonphishing_percentiles.csv
is data used in the UI.static/phishing_percentiles.csv
is also data used in the UI.
To make code reuse easier, I've opted to rely mainly on plain Python files instead of Jupyter notebooks.
cc2ld.py
contains a dictionary of sets of second level domains, keyed on country code. This was needed in order to parse domains correctly. For example,www.amazon.com
has a single level domain ofamazon.com
, butwww.amazon.co.uk
has a second level domain ofamazon.co.uk
.flask_app.py
is the Flask app. It contains 2GET
routes, the root HTML page/
and the prediction API endpoint/predict
.project3data.py
contains methods for getting the training data and running database migrations. I ended up not relying on the database as I had intended, but I have left the code in place for reference.project3models.py
contains methods for getting and training the models.project3sql_helpers.py
contains methods for connecting to both local and remote databases, with remote connections going through an SSH tunnel. The last two methods are decorators, used byproject3data.py
to inject sql sessions into the database migration methods.project3utils.py
contains methods for generating the features for a given URL.project3viz.py
contains a single method for plotting the ROC curve.
Bing_random_word_search.ipynb
contains the code that generateddata/bing_search_results.csv
.Phishing_URLs_EDA.ipynb
contains some EDA and model feature importance inspection.URL_Scraping.ipynb
contains the code that generateddata/scraped_urls.txt
.
These files are examples of hidden files that are in .gitignore
because they contain private info, i.e. passwords, API keys, etc.
secrets.py.example
contains passwords, API keys, etc. used byBing_random_word_search.ipynb
andproject3sql_helpers.py
..env.example
contains environment variables used by Docker.