This repository is for the analysis and modeling done with the malicious and benign websites dataset. Below you will find an overview of the data, code, and results.
Over 70% of all system intrusion breaches involve malware, and 32% of all malware is distributed via the web. The average cost of a data breach for an organization like IBM is 4.24 million dollars. Given the rise of remote work due to COVID-19, developing more efficient detection systems is imperative.
The models I chose were due to their potential to handle an imbalanced dataset. The dataset I used consisted of 12.1% malicious URLS and 87.9% benign. I chose models which have the ability to adjust prior probabilities, change class weights, or have a tunable cost parameter. I also tried downsampling and upsampling while tuning the parameters with cross-validation.
During my exploratory data analysis, I cleaned and feature engineered the data in the following ways:
- Fixed
CHARSET
to be uniformly uppercase and set "None" to NaN - Set values of "b", "0", and "None" in the date columns
WHOIS_REGDATE
andWHOIS_UPDATED_DATE
to NaN - Parsed date columns which contained different date formats
- Created new time feature by seperating time from date
- Created five new date features using datetime to extract day of year, day of month, day of week, weekend, and working day
- Seperated
SERVER
column into a binary matrix for each type of server using regular expressions - Removed server versions and non-server entries
- Set "None" and "[U'GB'; U'UK']" in
WHOIS_COUNTRY
column to NaN and "UK", respectively - Standardized
WHOIS_COUNTRY
column to country abbreviations from mix of country abbreviations and full names using a dictionary - Set "None", "P", "Not Applicable", random symbols, countries, email, and "Other" in
WHOIS_STATE
to NaN - Standardized
WHOSIS_STATE
column to state abbreviations from mix of state abbreviations, full names, and zip codes using a dictionary
Figures 1 through 4 are some of the highlights from my exploratory data analysis.
I split the data using stratified sampling on the malicious/bengin categorical variable (i.e., Type) with the training set getting 2/3 of the data and test set getting 1/3. I then created a pipeline that uses SimpleImputer and OneHotEncoder for categorical variables, and Simple Imputer and StandardScaler for numerical variables. For one subset of categorical variables (country and state), I chose to impute using the constant "UNAVAILABLE" because I thought that malicious websites might try to hide their country and state which is why it might be missing. For the rest of the categorical variables, I imputed with "most_frequent." For the numerical variables, I imputed with the feature's median value. Median was chosen to avoid excess influence by outliers had mean been chosen. The exploratory data analysis revealed that many of the numerical variables were right skewed. In the future, I may return to the imputation part of this project and experiment with different types of imputation to optimize imputation using cross-validation scores.
I chose models that might be effective against the imbalanced dataset. They have the ability to adjust prior probabilities, change class weights, or have a tunable cost parameter. In addition to having these qualities, I chose three non-flexible models (GaussianNB, LinearDiscriminantAnalysis, and LogisticRegression) and three flexible models (RandomForestClassifier, SVC, and XGBClassifier). I expected that the best performing model should come from the flexible model group and I would use that model's performance to compare the simpler models with. Should any of the simpler models have comparable performance then I could use that model due to easier interpretability. To evaluate the models I used five-fold cross validation. I tried three different tactics for training the models: upsampling, downsampling, and no sampling. For upsampling and downsampling, for each of the five folds I split the data into the training set and validation set then either upsampled or downsampled the training set. I upsampled the minority class (i.e., malicious website) or downsampled the majority class (i.e., benign website). These three different scenarios (upsampling, downsampling, and no sampling) were performed in three different py scripts. These sampling methods were deployed to help counteract the imbalanced dataset.
The models were evaluated using several metrics, but primarily with F1. Accuracy is not a good metric by itself due to the imbalanced dataset. Randomly guessing that every website is benign would yield an accuracy of 87.9%. Recall is important, because a false negative would result in the exposure to a malicious website. However, randomly guessing that every site is malicious would yield a perfect recall score, but an accuracy of 12.1% (it would also be very annoying to the user to have every website blocked). Precision also has the same issue as recall. ROC AUC is not a good metric for imbalanced datasets, because it's possible to have a high AUC, but low precision. Area under the precision-recall curve is more suitable for imbalanced datasets than ROC AUC. A balance between precision and recall is needed to ensure a balance between the reduction of false positives and false negatives. For this reason, F1 was chosen as the scoring metric. The models (with their best parameters) were evaluated with the test set and recall, precision, accuracy, F1, ROC AUC, and precision-recall AUC were calculated. The results are shown in Figures 5 through 10.
Using F1 scores, the best flexible model was the upsampled SVC and best non-flexible model was the upsampled LogisticRegression. Their metrics are shown in the table below.
Model | Recall | Precision | F1 | Accuracy | ROC AUC | Precision-Recall AUC |
---|---|---|---|---|---|---|
LogisticRegression | 0.789 | 0.848 | 0.818 | 0.957 | 0.882 | 0.882 |
SVC | 0.775 | 0.948 | 0.853 | 0.968 | 0.898 | 0.898 |
Below is the ROC AUC plot to compare the LogisticRegression and SVC models.
I am currently building an anomoly detection system using this dataset. The py file is code/anomaly_detection_system.py.
- Malicious and benign websites dataset
- A stacking model using URL and HTML features for phishing webpage detection
- Malicious URL Detection Based on Associative Classification
- Intelligent phishing url detection using association rule mining
- Applied Predictive Modeling by Max Kuhn and Kjell Johnson
- Upsampling and Downsampling Imbalanced Data in Python
- ROC Curves and Precision-Recall Curves for Imbalanced Classification