Name	Name	Last commit message	Last commit date
parent directory ..
baseline	baseline
data_processing	data_processing
img	img
models_with_clusters	models_with_clusters
README.md	README.md

Session length prediction using sequence to sequence models

Overview

The session length for a user is computed as depicted below:

Initial data processing (data_processing/):

Contains code for data exploration as well as pre-preprocessing

STEP 1 : Execute "1. get_unique_tracks.ipynb" to get unique track information across all users
STEP 2 : Execute "2. get_track_metadata.ipynb" to get track duration and genres using the lastFM TrackInfo API

STEP 3 : Execute "3. create_denormalized_users.ipynb" to combine profile,track and session data to create a denormalized view for each user.

 - Output files: user_dir/{user_id}/*.csv

 - Outout contains the following per user:
 	- user_id     : User ID
 	- timestamp   : Current timestamp in UTC
 	- artist_name : Name of the artist
 	- track_name  : Name of the track being listened to
 	- gender      : User gender (m,f,null)
 	- age         : Current age of the user (age is computed as age at registration + diff in years between registered date and timestamp)
 	- country     : Users country
 	- registered  : Data registered
 	- duration    : Track duration in seconds
 	- genre       : List of Genres associated to the track.

Baseline model (baseline/)

Code used to get the baseline sequence to sequence model performance

STEP 1 : Create summarized session details per user. Execute "0. create_data_utility.ipynb", "0. create_model_utility.ipynb" and "1. create_summarized_user_session.ipynb".

     - Output files: summary_dir/{type}/{user_id}.csv
     	- type = train\test\validate

 	- Columns per user:
 		- timestamp               : Milliseconds since epoch
 		- user_id                 : Integer representation of user_id
 		- session_id              : Unique sequence per user session
 		- gender                  : 1 for male, 0 for female, -1 otherwise
 		- age                     : integer
 		- country                 : integer representation of country
 		- registered              : Milliseconds since epoch (or a number for UNK)
 		- previous_session_length : Length of previous session in seconds
 		- average_session_length  : Average session length of previous sessions for the user in seconds
 		- current_session_length  : Current session length in seconds

STEP 2 : To train and test, execute "2. Train_and_test_model.ipynb". Use the following hyper parameters to test various models:
- GRU : Set "model_lstm" to False
- LSTM : Set "model_lstm" to True and "layered" to False
- Layered LSTM : Set "model_lstm" to True and "layered" to True and "no_layers" to 1 (Note: Code supports only 2 layers,not more.So set value to 1)
- Add dropout : Set array of values to "dropout".
- Additional hyperparameter:
  - "train_file" : Path to training data
  - "test_file" : Path to test data
  - "validation_file" : Path to validation data
  - "loss_func" : Loss function to be used in the Keras model
  - "optimizer" : Optimizer to be used
  - "hidden_dim" : Number of hidden dimensions in the network
  - "Batch_size" : batch size for training
  - "epochs" : Number of training epochs

Results

NOTE: We see similar results (0.87) for GRU, Layered LSTMS with\without drop out.

Sequence model with clusters (models_with_clusters/)

Overview

Steps

STEP 1: Build session data for analysis. Execute the following to build the data : 0. create_data_utility.ipynb, 1. build_complete_vocab.ipynb, 2. build_session_data.ipynb

 - Output files: final_dir/{type}/{user_id}.csv
 	- type = train\test\validate

 - Columns per user:
 	- user_id                 : Integer representation of user_id
 	- current_timestamp       : Milliseconds since epoch
 	- start_timestamp		  : start time of current session in milliseconds since epoch
 	- session_id              : Unique sequence per user session
 	- previous_session_length : Length of previous session in seconds
 	- average_session_length  : Average session length of previous sessions for the user in seconds
 	- gender                  : 1 for male, 0 for female, -1 otherwise
 	- age                     : integer
 	- country                 : integer representation of country
 	- registered              : Milliseconds since epoch (or a number for UNK)
 	- track_duration          : Track length in seconds
 	- times_played            : Number of times track is played in one session
 	- artist_name             : Integer representation of artist name
 	- track_name              : Integer representation of track name
 	- session_length          : Current session length in seconds

STEP 2: Build a user profile for cluster analysis. Execute "3. build_user_profiles.ipynb" to generate user profiles:

 - Output files: final_dir/user_profile_cluster.csv
 - Columns:
 	- user_id                 : Integer representation of user_id
 	- gender                  : 1 for male, 0 for female, -1 otherwise
 	- age                     : integer
 	- country                 : integer representation of country
 	- registered              : Milliseconds since epoch (or a number for UNK)
 	- top_artist              : Artist with highest occurence count across sessions for user in training data
 	- top_track               : Track with highest occurence count across sessions for user in training data
 	- total_sessions          : Total number of sessions for user  in training data
 	- average_session_length  : Average session length for the user in seconds in training data
 	- max_session_length      : Max session length for the user in seconds in training data
 	- median_session_length   : Median session length for the user in seconds in training data
 	- total_session_rows      : Total number of session records present in training data for the user

STEP 3: Cluster analysis. Run some cluster analysis to determine what kind of clustering to use (refer 4. cluster_analysis.ipynb and 5. create_model_utility.ipynb)
- Use util.plot_cluster_elbow() to determine the best number of clusters to use. Refer Elbow Method for more details.
- Use util.plot_clusters() to visualize clusters based on 2 dimensions.
- Use util.silhouette_analysis() to visualize a silhouette plot and analyze clusters based on the plot. Refer Silhouette Method for more details.
- Use util.get_baseline_mae() to get the baseline scores either in a standardized or raw form.
STEP 4: Train and test models with clustering (Refer 5. create_model_utility.ipynb, 6. train_and_test_model.ipynb).Use the following hyper parameters to test various models:
- clusters : Number of clusters to use
- Spectral Clustering : To use spectral clustering set "use_spectral_clustering" to True
- KMeans Clustering : To use KMeans, set "use_spectral_clustering" to False
- cluster dimensions : Use "cluster_columns" to specify the column numbers (as a tuple) to be used to determine the clusters.
- Standardize data : To standardize the data, set "standardize" to True, else set it to False
- mixed standardization: Setting "mix_std" to True will standardize data if there are less than 200 users in a cluster, otherwise non standardized data will be used.
- refer Baseline model Step 2 for details on other hyperparameters.s

Results

Below is a depiction of the results obtained over the timeline of this project

As seen above, a simple LSTM model with spectral clustering and mixed standardization seems to perform the best for the data at hand.

Dependencies

Tensorflow
Keras
PySpark
Python 3
Original Data: http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-1K.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

sequence_to_sequence

sequence_to_sequence

README.md

Session length prediction using sequence to sequence models

Overview

Initial data processing (data_processing/):

Baseline model (baseline/)

Results

Sequence model with clusters (models_with_clusters/)

Overview

Steps

Results

Dependencies

Files

sequence_to_sequence

Directory actions

More options

Directory actions

More options

Latest commit

History

sequence_to_sequence

Folders and files

parent directory

README.md

Session length prediction using sequence to sequence models

Overview

Initial data processing (data_processing/):

Baseline model (baseline/)

Results

Sequence model with clusters (models_with_clusters/)

Overview

Steps

Results

Dependencies