Here, we develop a brandly new framework, called Wide and ResDNN for general structural data classification tasks, such as CTR prediction, recommend system, etc. The model extend the DNN part of Wide and Deep model with arbitrary connections between layers, including connection mode similar to ResNet and DenseNet, which is widely used in CV.
This work is inspired by wide and deep model and ResNet DenseNet The wide model is able to memorize interactions with data with a large number of features but not able to generalize these learned interactions on new data. The deep model generalizes well but is unable to learn exceptions within the data. The wide and deep model combines the two models and is able to generalize while learning exceptions.
The code is based on TensorFlow wide and deep tutorial with high level tf.estimator.Estimator
API.
We use Kaggle Criteo and Avazu Dataset as examples.
- Python 2.7
- TensorFlow >= 1.10
- NumPy
- pyyaml
Kaggle Criteo Dataset Display Advertising Challenge
- train.csv - The training set consists of a portion of Criteo's traffic over a period of 7 days. Each row corresponds to a display ad served by Criteo. Positive (clicked) and negatives (non-clicked) examples have both been subsampled at different rates in order to reduce the dataset size. The examples are chronologically ordered.
- test.csv - The test set is computed in the same way as the training set but for events on the day following the training period. Note: the test.csv file label is unreleased, here we randomly split train.csv into train, dev, test set.
- Label - Target variable that indicates if an ad was clicked (1) or not (0).
- I1-I13 - A total of 13 columns of integer features (mostly count features).
- C1-C26 - A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes.
The semantic of the features is undisclosed. When a value is missing, the field is empty.
Kaggle Avazu Dataset Click-Through Rate Prediction
- train - Training set. 10 days of click-through data, ordered chronologically. Non-clicks and clicks are subsampled according to different strategies.
- test - Test set. 1 day of ads to for testing your model predictions. Note: the test file label is unreleased, here we randomly split train.csv into train, dev, test set.
- id: ad identifier
- click: 0/1 for non-click/click
- hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
- C1 -- anonymized categorical variable
- banner_pos
- site_id
- site_domain
- site_category
- app_id
- app_domain
- app_category
- device_id
- device_ip
- device_model
- device_type
- device_conn_type
- C14-C21 -- anonymized categorical variables
wide_resdnn
is a simple but powerful variant of wide_deep
, the main difference is the connection mode of deep part (DNN).
We hope to figure out the best kind of skip connections for large scale sparse data tasks. Here we provide five shortcut patterns (arbitrary connection is supported also) and two aggregation methods as follows:
shortcut:
normal
: use normal DNN with no residual connectionsfirst_dense
: add addition connections from first input layer to all hidden layers.last_dense
: add addition connections from all previous layers to last layer.dense
: add addition connections between all layers, similar to DenseNet.resnet
: add addition connections between adjacent layers, similar to ResNet.
aggregation:
sum
: sum the previous output, can only used for same hidden size architecture.concat
: concatenate the previous layers output
cd conf
vim feature.yaml
vim train.yaml
...
You can run the code locally as follows:
python train.py
python test.py
Run TensorBoard to inspect the details about the graph and training progression.
tensorboard --logdir=./model/wide_deep
For simplicity, we do not use cross features, it is highly dataset dependent.
We only do some basic feature engineering for generalization.
For continuous features, we use standard normalization transform as input,
for category features, we set hash_bucket_size according to its values size,
and we use embed category features for deep and not use dicretize continuous features for wide.
The specific parameters setting see conf/*/train.yaml
First, we evaluate the base model wide_deep
to chose best network architecture.
network | 1024-1024-1024 | 512-512-512 | 256-256-256 | 128-128-128 | 64-64-64 |
---|---|---|---|---|---|
auc | 0.7763 | 0.7762 | 0.7798 | 0.7776 | |
logloss | 0.4700 | 0.4709 | 0.4672 | 0.4687 |
From the result we found that 256-256-256
architecture works best,
we also found that dropout decrease the performance.
Then, we evaluate wide_deep
with different number of layers using 256
hidden units.
layers | 2 | 3 | 5 | 7 | 9 |
---|---|---|---|---|---|
auc | 0.7826 | 0.7808 | 0.7783 | 0.7719 | 0.7654 |
logloss | 0.4649 | 0.4662 | 0.4680 | 0.4728 | 0.4805 |
From the result we found that the performance degrade as the network become deeper.
Then, we evaluate our wide_resdnn
model with connect mode and residual mode using fixed 256-256-256
architecture.
hidden size | 256 | 64 |
model | auc logloss | auc logloss |
---|---|---|
wide_deep | 0.7798 0.4672 | |
first_dense/concat | 0.7816 0.4661 | 0.7851 0.4629 |
first_dense/sum | 0.7843 0.4636 | 0.7850 0.4630 |
last_dense/concat | 0.7767 0.4764 | 0.7836 0.4646 |
last_dense/sum | 0.7840 0.4636 | 0.7839 0.4640 |
dense/concat | 0.7435 0.8494 | 0.7662 0.5197 |
dense/sum | 0.7839 0.4640 | 0.7821 0.4652 |
resnet/concat | 0.7708 0.5023 | 0.7849 0.4633 |
resnet/sum | 0.7841 0.4637 | 0.7858 0.4627 |
We found that sum
is consistently better than concat
for 256-256-256
, all the four shortcut result in similar results and our wide_resdnn
significantly better than wide_deep
.
Then, we evaluate multi-resdnn
.
network | 128-128-128,128-128-128 | 64-64-64,64-64 |
---|---|---|
shortcut | first_dense,last_dense | resnet, first_dense |
aggregation | sum,sum | sum, sum |
auc | 0.7862 | 0.7857 |
logloss | 0.4623 | 0.4625 |
We found that multi-ResDNN
has a small improvement.
Finally, we need to evaluate model variance, we run each model 10 times to calculate related auc and logloss statics.
The network setting is 256-256-256
.
model | wide_deep | wide_resdnn |
---|---|---|
1 | 0.7808 0.4662 | 0.7843 0.4636 |
2 | 0.7798 0.4672 | 0.7838 0.4652 |
3 | 0.7783 0.4685 | 0.7842 0.4640 |
4 | 0.7828 0.4653 | 0.7818 0.4670 |
5 | 0.7767 0.4695 | 0.7841 0.4638 |
6 | 0.7826 0.4651 | 0.7823 0.4653 |
7 | 0.7783 0.4685 | 0.7831 0.4648 |
8 | 0.7767 0.4699 | 0.7841 0.4638 |
9 | 0.7775 0.4689 | 0.7827 0.4654 |
10 | 0.7821 0.4655 | 0.7831 0.4647 |
mean | 0.7796 0.4675 | 0.7834 0.4648 |
std | 0.0023 0.0017 | 0.0009 0.0010 |
We found wide_resdnn
is significantly better than wide_deep
and has lower variance.
First, we evaluate the base model wide_deep
to chose best network architecture.
wide_deep | 512-512-512 | 256-256-256 | 128-128-128 | 64-64-64 |
---|---|---|---|---|
auc | 0.7504 | 0.7505 | 0.7504 | 0.7505 |
logloss | 0.3966 | 0.3966 | 0.3967 | 0.3966 |
We found hidden size
has little influence on performance.
Then, we evaluate our wide_resdnn
model with connect mode and residual mode using fixed 64-64-64
architecture.
model | auc logloss |
---|---|
wide_deep | 0.7505 0.3966 |
first_dense/concat | 0.7517 0.3955 |
first_dense/sum | 0.7533 0.3948 |
last_dense/concat | 0.7528 0.3949 |
last_dense/sum | 0.7527 0.3951 |
dense/concat | 0.7528 0.3949 |
dense/sum | 0.7544 0.3943 |
resnet/concat | 0.7525 0.3951 |
resnet/sum | 0.7537 0.3943 |