This repo contains files of recommendation system for
+-- - Bash script to fill crontab tasks for a model rebuilding
+-- - Bash script to clean crontab and to stop all daemons
+-- - Package configuration
+-- golosio_recommendation_model
+-- - Overall model configuration
+-- - Function for making daemon of a specified function
+-- server
+-- - Flask server for recommendation system
+-- - Server configuration
+-- sync
+-- - Convert events in MongoDB for training FFM model
+-- - Synchronizing MongoDB with Golos node
+-- - Synchronizing Golosio MySQL events with MongoDB
+-- - Synchronizing Golosio MySQL accounts with MongoDB
+-- model
+-- - Helpers for preprocessing, processes regulation and etc.
+-- train
+-- - Process of training model to find similar posts
+-- - Process of training model to find doc2vec vectors for each post
+-- - Process of training FFM model to arrange recommendations for each user
+-- predict
+-- - Process of finding similar posts for new posts in database
+-- - Process of finding doc2vec vectors for each new post in database
+-- - Process of creating recommendations list for each active user
+-- bin - These scripts will appear in /usr/local/bin directory
+-- doc2vec_train - Daemon that trains doc2vec model
+-- doc2vec_predict - Daemon that makes doc2vec predictions for all posts in database
+-- ann_train - Daemon that trains ANN model
+-- ann_predict - Daemon that makes ANN predictions for all posts in database
+-- ffm_train - Daemon that trains FFM model
+-- ffm_predict - Daemon that makes FFM predictions and stores them to a database
+-- recommendations_server - Daemon for a recommendation model server
+-- sync_comments - Daemon that loads new comments from a golos node to a database
+-- sync_events - Daemon that loads events from a specified mysql DB to a database
+-- sync_accounts - Daemon that loads accounts from a specified csv file to a database
Recommendation model architecture:
Install LibFFM before usage. Instruction can be found here:
Prepare mongo database before installation. You can load current mongo dumps here:
$ scp ./
$ scp ./
Prepare config file before installation. It should looks like this:
# golosio_recommendation_model/
config = {
'database_url': "localhost:27017", # Your mongo database url
'database_name': "golos_comments", # Mongo database with dumps content
'accounts_path': "/home/anatoli/Documents/golosio_recommendation_model/accounts.csv", # Path to csv file with accounts, only for debug
'node_url': 'ws://localhost:8090', # websocket url
'model_path': "/tmp/", # Path to model files
'log_path': "/tmp/recommendation_model.log", # Path to model log
'events_database': { # Credentials for mysql database with events
'host': 'localhost',
'database': 'golos',
'user': 'root',
'password': 'root'
Install a package with:
$ pip3 install .
To add model daemons to a crontab, use:
This script will add train tasks to a crontab and will start comments synchronization.
It'll take some time to generate a new version of a model. For example, You'll get new model after a full day, if you ran installation script at 22:00. If you want to get first version as quickly as possible, run daemons manually:
$ doc2vec_train start
$ ann_train start
$ ffm_train start
To stop model daemons and to clean crontab, run:
To add new events to a database, run:
$ sync_events start
To update accounts in a database, run:
$ sync_accounts start
To start server, run:
$ recommendations_server start
To get similar posts and distances to each of them for a specified one, run:
$ curl http://localhost:8080/similar?permlink=POST_PERMLINK
For example:
$ curl http://localhost:8080/similar?permlink=@cka3ka/0x-zrx-naverno-zatuzemunit-skoro-50-50
To get recommendations for specified user, run:
curl http://localhost:8080/recommendations?user=USER_ID
For example:
$ curl http://localhost:8080/recommendations?user=58158
"post_permlink": "@tarimta/obektivnyi-marafon-etap-3",
"prediction": 0.9400154948234558
"post_permlink": "@lumia/estafeta-prodolzhi-pesnyu-zadushevnaya",
"prediction": 0.9309653043746948
"post_permlink": "@oksi969/dizain-cheloveka-lyubov-i-napravlenie-g-centr",
"prediction": 0.9016984701156616
"post_permlink": "@is-pain/vzveshennye-lyudi-or-minus-16-kilogramm-za-dva-mesyaca",
"prediction": 0.8760964870452881
"post_permlink": "@miroslav/golos-photography-awards-edinstvennaya",
"prediction": 0.8590876460075378
To get recommendations for specified user and specified post, run:
curl http://localhost:8080/post_recommendations?user=USER_ID&permlink=POST_PERMLINK
For example:
$ curl http://localhost:8080/post_recommendations?user=71116&permlink=@cka3ka/golos-tuzemun
"post_permlink": "@cka3ka/0x-zrx-naverno-zatuzemunit-skoro-50-50",
"prediction": 0.34954845905303955
"post_permlink": "@igrinov50-50/skonchalsya-leonid-bronevoi",
"prediction": 0.3138478994369507
"post_permlink": "@ljpromo/isportilas-autentichnost",
"prediction": 0.16339488327503204
"post_permlink": "@cka3ka/bitcoin-stal-shestym-po-populyarnosti-sredi-mirovykh-valyut",
"prediction": 0.07751113921403885
To get supported user ids, run
$ curl http://localhost:8080/users
To find user id by username, run:
$ curl http://localhost:8080/user_id?user_name=USER_NAME
For example:
$ curl http://localhost:8080/user_id?user_name=smartcity-admin
"user_id": 60837
To get page views for some user, run:
$ curl http://localhost:8080/history?user=USER_ID
For example:
$ curl http://localhost:8080/history?user=58158
Overall service configuration situated in file, but most of the configuration hidden in .py files deep inside a package. You can additionally modify lines below to get better results:
You can change service port here:
# server/
port = 8080 # Use desired port
It's highly recommended not to play with these parameters, but you can do it at your own risk in these files:
# sync/
HOURS_LIMIT = 14 * 24 # Time window size (in hours) for events extraction. Bigger values makes recommendations less sensitive to changes in preferences
# model/train/
WORD_LENGTH_QUANTILE = 10 # Remove words shorter than 90%
TEXT_LENGTH_QUANTILE = 66 # Remove texts shorter than 66%
HIGH_WORD_FREQUENCY_QUANTILE = 99.5 # Remove words that appears more often than 99.5%
LOW_WORD_FREQUENCY_QUANTILE = 60 # Remove words that appears less often than 30%
# Parameters of doc2vec model. You can read about them in this article:
'size': 300,
'window': 20,
'min_count': 5,
'workers': 13
# model/train/
NUMBER_OF_TREES = 1000 # Number of trees in the ANN model. Bigger values means slow prediction and high quality of result
NUMBER_OF_VALUES = 1000 # Number of values for one-hot encoding of categorical features. Bigger values means slow preparation and high quality of result
# model/train/
# FFM parameters. You can read about them in this article:
'eta': 0.1,
'lam': 0.01,
'k': 70
ITERATIONS = 10 # Iterations of training process
WORKERS = 13 # Number of workers for dataset processing. Should be equal to AVAILABLE_CORES + 1
# model/predict/
DOC2VEC_STEPS = 2500 # Number of steps for doc2vec model. Bigger values means slow prediction and fast convergence
DOC2VEC_ALPHA = 0.03 # Learning rate for doc2vec model. Bigger values means fast prediction and slow convergence
# model/predict/
NUMBER_OF_RECOMMENDATIONS = 50 # Number of similar posts for a post
To run load tests, download first version of a model and use:
python3 -m unittest tests.load_test_case
It'll show average response time for actions that returns recommendations and similar posts.
To see model logs, run:
tail -f ./model.log
Processing time:
- doc2vec - 2.5h
- train - 1.5h
- predict - 1h
- ann - 2h
- train - 1h
- predict 1h
- ffm - 5h
- train - 4h
- predict - 1h
Tested on a server with i7-5930K, 128Gb DDR-4, 1 Tb SSD-PCIe.