Skip to content

Files

Latest commit

 

History

History

sequence_to_sequence

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Session length prediction using sequence to sequence models

Overview

alt Data Pipeline

The session length for a user is computed as depicted below:

alt Data Pipeline

Initial data processing (data_processing/):

Contains code for data exploration as well as pre-preprocessing

  • STEP 1 : Execute "1. get_unique_tracks.ipynb" to get unique track information across all users

  • STEP 2 : Execute "2. get_track_metadata.ipynb" to get track duration and genres using the lastFM TrackInfo API

  • STEP 3 : Execute "3. create_denormalized_users.ipynb" to combine profile,track and session data to create a denormalized view for each user.

     - Output files: user_dir/{user_id}/*.csv
    
     - Outout contains the following per user:
     	- user_id     : User ID
     	- timestamp   : Current timestamp in UTC
     	- artist_name : Name of the artist
     	- track_name  : Name of the track being listened to
     	- gender      : User gender (m,f,null)
     	- age         : Current age of the user (age is computed as age at registration + diff in years between registered date and timestamp)
     	- country     : Users country
     	- registered  : Data registered
     	- duration    : Track duration in seconds
     	- genre       : List of Genres associated to the track.
    

Baseline model (baseline/)

Code used to get the baseline sequence to sequence model performance

  • STEP 1 : Create summarized session details per user. Execute "0. create_data_utility.ipynb", "0. create_model_utility.ipynb" and "1. create_summarized_user_session.ipynb".

         - Output files: summary_dir/{type}/{user_id}.csv
         	- type = train\test\validate
    
     	- Columns per user:
     		- timestamp               : Milliseconds since epoch
     		- user_id                 : Integer representation of user_id
     		- session_id              : Unique sequence per user session
     		- gender                  : 1 for male, 0 for female, -1 otherwise
     		- age                     : integer
     		- country                 : integer representation of country
     		- registered              : Milliseconds since epoch (or a number for UNK)
     		- previous_session_length : Length of previous session in seconds
     		- average_session_length  : Average session length of previous sessions for the user in seconds
     		- current_session_length  : Current session length in seconds
    
  • STEP 2 : To train and test, execute "2. Train_and_test_model.ipynb". Use the following hyper parameters to test various models:

    • GRU : Set "model_lstm" to False
    • LSTM : Set "model_lstm" to True and "layered" to False
    • Layered LSTM : Set "model_lstm" to True and "layered" to True and "no_layers" to 1 (Note: Code supports only 2 layers,not more.So set value to 1)
    • Add dropout : Set array of values to "dropout".
    • Additional hyperparameter:
      • "train_file" : Path to training data
      • "test_file" : Path to test data
      • "validation_file" : Path to validation data
      • "loss_func" : Loss function to be used in the Keras model
      • "optimizer" : Optimizer to be used
      • "hidden_dim" : Number of hidden dimensions in the network
      • "Batch_size" : batch size for training
      • "epochs" : Number of training epochs

Results

alt Baseline Results

NOTE: We see similar results (0.87) for GRU, Layered LSTMS with\without drop out.

Sequence model with clusters (models_with_clusters/)

Overview

alt Cluster Pipeline

Steps

  • STEP 1: Build session data for analysis. Execute the following to build the data : 0. create_data_utility.ipynb, 1. build_complete_vocab.ipynb, 2. build_session_data.ipynb

     - Output files: final_dir/{type}/{user_id}.csv
     	- type = train\test\validate
    
     - Columns per user:
     	- user_id                 : Integer representation of user_id
     	- current_timestamp       : Milliseconds since epoch
     	- start_timestamp		  : start time of current session in milliseconds since epoch
     	- session_id              : Unique sequence per user session
     	- previous_session_length : Length of previous session in seconds
     	- average_session_length  : Average session length of previous sessions for the user in seconds
     	- gender                  : 1 for male, 0 for female, -1 otherwise
     	- age                     : integer
     	- country                 : integer representation of country
     	- registered              : Milliseconds since epoch (or a number for UNK)
     	- track_duration          : Track length in seconds
     	- times_played            : Number of times track is played in one session
     	- artist_name             : Integer representation of artist name
     	- track_name              : Integer representation of track name
     	- session_length          : Current session length in seconds
    
  • STEP 2: Build a user profile for cluster analysis. Execute "3. build_user_profiles.ipynb" to generate user profiles:

     - Output files: final_dir/user_profile_cluster.csv
     - Columns:
     	- user_id                 : Integer representation of user_id
     	- gender                  : 1 for male, 0 for female, -1 otherwise
     	- age                     : integer
     	- country                 : integer representation of country
     	- registered              : Milliseconds since epoch (or a number for UNK)
     	- top_artist              : Artist with highest occurence count across sessions for user in training data
     	- top_track               : Track with highest occurence count across sessions for user in training data
     	- total_sessions          : Total number of sessions for user  in training data
     	- average_session_length  : Average session length for the user in seconds in training data
     	- max_session_length      : Max session length for the user in seconds in training data
     	- median_session_length   : Median session length for the user in seconds in training data
     	- total_session_rows      : Total number of session records present in training data for the user
    
  • STEP 3: Cluster analysis. Run some cluster analysis to determine what kind of clustering to use (refer 4. cluster_analysis.ipynb and 5. create_model_utility.ipynb)

    • Use util.plot_cluster_elbow() to determine the best number of clusters to use. Refer Elbow Method for more details.
    • Use util.plot_clusters() to visualize clusters based on 2 dimensions.
    • Use util.silhouette_analysis() to visualize a silhouette plot and analyze clusters based on the plot. Refer Silhouette Method for more details.
    • Use util.get_baseline_mae() to get the baseline scores either in a standardized or raw form.
  • STEP 4: Train and test models with clustering (Refer 5. create_model_utility.ipynb, 6. train_and_test_model.ipynb).Use the following hyper parameters to test various models:

    • clusters : Number of clusters to use
    • Spectral Clustering : To use spectral clustering set "use_spectral_clustering" to True
    • KMeans Clustering : To use KMeans, set "use_spectral_clustering" to False
    • cluster dimensions : Use "cluster_columns" to specify the column numbers (as a tuple) to be used to determine the clusters.
    • Standardize data : To standardize the data, set "standardize" to True, else set it to False
    • mixed standardization: Setting "mix_std" to True will standardize data if there are less than 200 users in a cluster, otherwise non standardized data will be used.
    • refer Baseline model Step 2 for details on other hyperparameters.s

Results

Below is a depiction of the results obtained over the timeline of this project

alt Results

As seen above, a simple LSTM model with spectral clustering and mixed standardization seems to perform the best for the data at hand.

Dependencies