Classifying Conflict Event Data: A Comparison of Flat and Hierarchical Classification Techniques

This repository is part of the master thesis 'Classifying Conflict Event Data: A Comparison of Flat and Hierarchical Classification Techniques' by Friederike Bauer written as part of the master 'Social and Economic Data Science' at the university of Konstanz.

It is used to perform the analysis of comparing flat and hierarchical text classification.

The combination of feature extractor + classifier + classification mode (flat or hierarchical) creates these model combinations:

The file structure is as follows:

acled-classification
├── data                        # folder needs to be created
│ ├── raw                       # includes all the downloads from the ACLED data export tool, needs to be manually filled by user
│ └── processed                 # all training and test data that was processed is saved here
│
├── paths
│ └── .env.eda                  # file declaring all paths
│
├── src
│ ├── eda
│ │ ├── class_distribution_visualization.py
│ │ └── crossvalidation_analysis.py
│ │
│ ├── evaluation_graphs
│ │ ├── level2.py
│ │ ├── predictions_shares.py
│ │ └── runtime.py
│ │
│ ├── feature extractors
│ │ ├── BERT_vectorizer.py
│ │ ├── CountVectorizer.py
│ │ ├── FastText.py
│ │ ├── TFIDF.py
│ │ └── Word2Vec.py
│ │
│ ├── logs                      # logs are created to oversee the scripts
│ ├── models
│ │ ├── LR.py
│ │ ├── RF.py
│ │ └── SVM.py
│ │
│ ├── predictions
│ │ ├── model_combinations.png
│ │ ├── lr_flat_level2.py
│ │ ├── ...
│ │ └── svm_hierarchical.py
│ │
│ ├── preprocess
│ │ ├── dataset_creation.py
│ │ └── preprocessing.py
│ │
│ ├── results
│ │ ├── cross_validation
│ │ │   ├── lr
│ │ │   ├── svm
│ │ │   └── rf
│ │ └── prediction_results
│ │     ├── lr
│ │     ├── svm
│ │     └── rf
│ │
│ ├── utils
│ │ ├── conversion_functions.py
│ │ ├── cross_validation.py
│ │ └── evaluation_functions.py
│ └── settings.py
│
├── .gitignore
├──poetry.lock
├──pyproject.toml
└── README.md

Steps in order:

1) Dataset creation

The raw data can be download via the ACLED website. All data years 1997 - 2022 are used and all available countries and event types.

-> file src.preprocess.dataset_creation.py

2) Dataset preprocessing

Raw data files are preprocessed to only include the notes section and info about event type and sub-event type. Other methods for cleaning the notes section and enriching the data are applied.

-> file src.preprocess.preprocessing.py

3) Predictions and Evaluation

The scripts in src.predictions are written for the predictions that are analysed in the thesis. An overview of them can be found in the following table:

Classifier	Flat	Hierarchical
Logistic Regression	lr_flat_level1.py lr_flat_level2.py	lr_hierarchical.py lr_hierarchical_three_levels.py
Support Vector Machine	svm_flat_level1.py svm_flat_level2.py	svm_hierarchical.py svm_hierarchical_three_levels.py
Random Forest	rf_flat_level1.py rf_flat_level2.py	rf_hierarchical.py rf_hierarchical_three_levels.py

Each script uses all combinations of the classifier and all feature extraction methods. The option for cross-validation is commented out due to long runtime but can be un-commented for running the cross-validation analysis. The evaluation of the feature extraction and classifier combinations happens through these scripts. Files for the results will be generated in src.results (either in cross_validation or prediction_results) and sorted into folders by classifier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Classifying Conflict Event Data: A Comparison of Flat and Hierarchical Classification Techniques

Steps in order:

1) Dataset creation

2) Dataset preprocessing

3) Predictions and Evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Classifying Conflict Event Data: A Comparison of Flat and Hierarchical Classification Techniques

Steps in order:

1) Dataset creation

2) Dataset preprocessing

3) Predictions and Evaluation