The session length for a user is computed as depicted below:
Initial data processing (data_processing/):
Contains code for data exploration as well as pre-preprocessing
-
STEP 1 : Execute "1. get_unique_tracks.ipynb" to get unique track information across all users
-
STEP 2 : Execute "2. get_track_metadata.ipynb" to get track duration and genres using the lastFM TrackInfo API
-
STEP 3 : Execute "3. create_denormalized_users.ipynb" to combine profile,track and session data to create a denormalized view for each user.
- Output files: user_dir/{user_id}/*.csv - Outout contains the following per user: - user_id : User ID - timestamp : Current timestamp in UTC - artist_name : Name of the artist - track_name : Name of the track being listened to - gender : User gender (m,f,null) - age : Current age of the user (age is computed as age at registration + diff in years between registered date and timestamp) - country : Users country - registered : Data registered - duration : Track duration in seconds - genre : List of Genres associated to the track.
Baseline model (baseline/)
Code used to get the baseline sequence to sequence model performance
-
STEP 1 : Create summarized session details per user. Execute "0. create_data_utility.ipynb", "0. create_model_utility.ipynb" and "1. create_summarized_user_session.ipynb".
- Output files: summary_dir/{type}/{user_id}.csv - type = train\test\validate - Columns per user: - timestamp : Milliseconds since epoch - user_id : Integer representation of user_id - session_id : Unique sequence per user session - gender : 1 for male, 0 for female, -1 otherwise - age : integer - country : integer representation of country - registered : Milliseconds since epoch (or a number for UNK) - previous_session_length : Length of previous session in seconds - average_session_length : Average session length of previous sessions for the user in seconds - current_session_length : Current session length in seconds
-
STEP 2 : To train and test, execute "2. Train_and_test_model.ipynb". Use the following hyper parameters to test various models:
- GRU : Set "model_lstm" to False
- LSTM : Set "model_lstm" to True and "layered" to False
- Layered LSTM : Set "model_lstm" to True and "layered" to True and "no_layers" to 1 (Note: Code supports only 2 layers,not more.So set value to 1)
- Add dropout : Set array of values to "dropout".
- Additional hyperparameter:
- "train_file" : Path to training data
- "test_file" : Path to test data
- "validation_file" : Path to validation data
- "loss_func" : Loss function to be used in the Keras model
- "optimizer" : Optimizer to be used
- "hidden_dim" : Number of hidden dimensions in the network
- "Batch_size" : batch size for training
- "epochs" : Number of training epochs
NOTE: We see similar results (0.87) for GRU, Layered LSTMS with\without drop out.
Sequence model with clusters (models_with_clusters/)
-
STEP 1: Build session data for analysis. Execute the following to build the data : 0. create_data_utility.ipynb, 1. build_complete_vocab.ipynb, 2. build_session_data.ipynb
- Output files: final_dir/{type}/{user_id}.csv - type = train\test\validate - Columns per user: - user_id : Integer representation of user_id - current_timestamp : Milliseconds since epoch - start_timestamp : start time of current session in milliseconds since epoch - session_id : Unique sequence per user session - previous_session_length : Length of previous session in seconds - average_session_length : Average session length of previous sessions for the user in seconds - gender : 1 for male, 0 for female, -1 otherwise - age : integer - country : integer representation of country - registered : Milliseconds since epoch (or a number for UNK) - track_duration : Track length in seconds - times_played : Number of times track is played in one session - artist_name : Integer representation of artist name - track_name : Integer representation of track name - session_length : Current session length in seconds
-
STEP 2: Build a user profile for cluster analysis. Execute "3. build_user_profiles.ipynb" to generate user profiles:
- Output files: final_dir/user_profile_cluster.csv - Columns: - user_id : Integer representation of user_id - gender : 1 for male, 0 for female, -1 otherwise - age : integer - country : integer representation of country - registered : Milliseconds since epoch (or a number for UNK) - top_artist : Artist with highest occurence count across sessions for user in training data - top_track : Track with highest occurence count across sessions for user in training data - total_sessions : Total number of sessions for user in training data - average_session_length : Average session length for the user in seconds in training data - max_session_length : Max session length for the user in seconds in training data - median_session_length : Median session length for the user in seconds in training data - total_session_rows : Total number of session records present in training data for the user
-
STEP 3: Cluster analysis. Run some cluster analysis to determine what kind of clustering to use (refer 4. cluster_analysis.ipynb and 5. create_model_utility.ipynb)
- Use
util.plot_cluster_elbow()
to determine the best number of clusters to use. Refer Elbow Method for more details. - Use
util.plot_clusters()
to visualize clusters based on 2 dimensions. - Use
util.silhouette_analysis()
to visualize a silhouette plot and analyze clusters based on the plot. Refer Silhouette Method for more details. - Use
util.get_baseline_mae()
to get the baseline scores either in a standardized or raw form.
- Use
-
STEP 4: Train and test models with clustering (Refer 5. create_model_utility.ipynb, 6. train_and_test_model.ipynb).Use the following hyper parameters to test various models:
- clusters : Number of clusters to use
- Spectral Clustering : To use spectral clustering set "use_spectral_clustering" to True
- KMeans Clustering : To use KMeans, set "use_spectral_clustering" to False
- cluster dimensions : Use "cluster_columns" to specify the column numbers (as a tuple) to be used to determine the clusters.
- Standardize data : To standardize the data, set "standardize" to True, else set it to False
- mixed standardization: Setting "mix_std" to True will standardize data if there are less than 200 users in a cluster, otherwise non standardized data will be used.
- refer Baseline model Step 2 for details on other hyperparameters.s
Below is a depiction of the results obtained over the timeline of this project
As seen above, a simple LSTM model with spectral clustering and mixed standardization seems to perform the best for the data at hand.
- Tensorflow
- Keras
- PySpark
- Python 3
- Original Data: http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-1K.html