Skip to content

Latest commit

 

History

History
70 lines (47 loc) · 4.91 KB

README.md

File metadata and controls

70 lines (47 loc) · 4.91 KB

Project 3 Overview

The aim of this project was to train a classifier to distinguish URLs of phishing sites from ordinary URLs. The resulting model is used in a JSON API built with Flask, which accepts an arbitrary URL as input and responds with the generated features and predictions for that URL. The Flask app also serves a GUI webpage that visualizes the features of the given URL as percentiles of the overall phishing vs non-phishing data. The Flask app and Postgres database both run within Docker.

The app can be accessed at https://watermelon.calculist.io.

app screenshot

Data Sources

Files

data/

db_migrations/

This directory was intended to contain a sequence of .sql files that update the database schema. See the run_migrations method in project3data.py to see how migrations are executed.

static/

*.py

To make code reuse easier, I've opted to rely mainly on plain Python files instead of Jupyter notebooks.

  • cc2ld.py contains a dictionary of sets of second level domains, keyed on country code. This was needed in order to parse domains correctly. For example, www.amazon.com has a single level domain of amazon.com, but www.amazon.co.uk has a second level domain of amazon.co.uk.
  • flask_app.py is the Flask app. It contains 2 GET routes, the root HTML page / and the prediction API endpoint /predict.
  • project3data.py contains methods for getting the training data and running database migrations. I ended up not relying on the database as I had intended, but I have left the code in place for reference.
  • project3models.py contains methods for getting and training the models.
  • project3sql_helpers.py contains methods for connecting to both local and remote databases, with remote connections going through an SSH tunnel. The last two methods are decorators, used by project3data.py to inject sql sessions into the database migration methods.
  • project3utils.py contains methods for generating the features for a given URL.
  • project3viz.py contains a single method for plotting the ROC curve.

*.ipynb

*.example

These files are examples of hidden files that are in .gitignore because they contain private info, i.e. passwords, API keys, etc.

Inspiration