Public repo for my masters thesis for the chair of Data and Web science:
First of all the WebIsALOD dataset should be downloaded, extracted and saved in the data
folder.
-
Fix the dataset URI's: To fix the dataset URI's run the python script called
fix_dataset_uris.py
. -
Extract concept documents files and save preprocessed clean files:
To save the clean preprocessed files run the python script called
Read_And_Clean.py
. -
Download Wikipedia data:
Use the following script to download the latest Wikipedia English articles dump:
curl –O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
-
Preprocess Wikipedia data using Gensim:
To preprocess the Wikipedia data use the Gensim's script:
python -m gensim.scripts.make_wiki
-
Train LDA model with Wikipedia data:
wiki_wordids.txt
andwiki_tfidf.mm
files generated in the previous step are required by the models using Wikipedia data.To train the LDA models with Wikipedia data, run the python script called
wiki_lda.py
. -
Train LDA model with WebIsALOD data:
To train the LDA models with WebIsALOD data, run the python script called
webisalod_lda.py
. -
Train HDP model:
To train the LDA models with Wikipedia data, run the python script called
wiki_hdp.py
. -
Classification using only topic modeling:
To run the classification model with only topic modeling, run the python script called
polysemous_words.py
. -
Classification using topic modeling and supervised machine learning algorithms:
To run the classification model with only topic modeling, run the python script called
supervised_classifier.py
.