Uni Mannheim - Business Informatics Masters Thesis - Arshad Mehmood

Public repo for my masters thesis for the chair of Data and Web science:

First of all the WebIsALOD dataset should be downloaded, extracted and saved in the data folder.

Fix the dataset URI's: To fix the dataset URI's run the python script called fix_dataset_uris.py.
Extract concept documents files and save preprocessed clean files:

To save the clean preprocessed files run the python script called Read_And_Clean.py.
Download Wikipedia data:

Use the following script to download the latest Wikipedia English articles dump:
```
curl –O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
```
Preprocess Wikipedia data using Gensim:

To preprocess the Wikipedia data use the Gensim's script:
```
python -m gensim.scripts.make_wiki
```
Train LDA model with Wikipedia data:

wiki_wordids.txt and wiki_tfidf.mm files generated in the previous step are required by the models using Wikipedia data.

To train the LDA models with Wikipedia data, run the python script called wiki_lda.py.
Train LDA model with WebIsALOD data:

To train the LDA models with WebIsALOD data, run the python script called webisalod_lda.py.
Train HDP model:

To train the LDA models with Wikipedia data, run the python script called wiki_hdp.py.
Classification using only topic modeling:

To run the classification model with only topic modeling, run the python script called polysemous_words.py.
Classification using topic modeling and supervised machine learning algorithms:

To run the classification model with only topic modeling, run the python script called supervised_classifier.py.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Utils		Utils
data		data
output		output
.env		.env
LICENSE		LICENSE
README.md		README.md
Read_And_Clean.py		Read_And_Clean.py
coherence_log_perplexity.py		coherence_log_perplexity.py
evaluate.py		evaluate.py
fix_dataset_uris.py		fix_dataset_uris.py
polysemous_words.py		polysemous_words.py
requirements.txt		requirements.txt
settings.py		settings.py
supervised_classifier.py		supervised_classifier.py
webisalod_lda.py		webisalod_lda.py
wiki_hdp.py		wiki_hdp.py
wiki_lda.py		wiki_lda.py

Provide feedback