Run TF-IDF and LSI on existing subreddit comments, and given user's new comment, try predicting and recommending subreddit.
Build document-term matrix from BigQuery data, then run LDA to find topics distribution for each subreddit, and apply t-SNE dimension reduction with matplotlib visualization.
Construct a bipartite graph between authors and topics, and propagate back and forth the labels to identify generalist/specialist among reddit authors for differnt community.
Fine tune the model from week3, with TF-IDF weights applied on BOW matrix but keep in same magnitude.
Examine the validity of models obtained from week4, and refine models by tuning hyper-parameters.
Apply non-semantic techniques(finding overlapping commenters), and semantic techniques(such as LSA, word2vec) to examine similarity between each subreddits.