Amharic Named Entity Recognition (NER) System for EthioMart

Overview

This repository contains the implementation of a Named Entity Recognition (NER) system tailored for the Amharic language. The system is built using XLM-RoBERTa, a state-of-the-art multilingual transformer model, and is designed to extract key entities such as product names, prices, and locations from Amharic text data. The primary application of this system is for EthioMart, leveraging insights from Telegram messages.

Features

Amharic Text Segmentation: Utilizes the amseg library for accurate tokenization of Amharic text.
Data Cleaning and Preprocessing: Handles morphological complexities, removes unnecessary characters, and aligns tokens with labels.
Fine-Tuned Transformer Model: Employs XLM-RoBERTa for robust multilingual NER tasks.
Performance Metrics: Evaluates model performance using precision, recall, and F1-score.

Dataset

The dataset consists of Amharic text messages scraped from Telegram channels such as @MerttEka. These messages provide a rich source of information for identifying products, prices, and locations.

Installation

Clone the repository:

git clone https://github.com/your_username/dagiteferi-ethiomart-amharic-nerllm-model.git
cd dagiteferi-ethiomart-amharic-nerllm-model

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Data Preparation

Run the scrapper.py script located in the scripts/ directory to scrape data.
Use the provided preprocessing scripts to clean and label the data:
```
python scripts/preprocessing.py
```

Model Training

Fine-tune the model using the notebook:
```
notebooks/XLM_Fine-tune.ipynb
```

Evaluation

Evaluate the model’s performance:
```
python scripts/evaluate_model.py
```

Results

The fine-tuned model achieved the following performance metrics:

Precision: 99.9%
Recall:1.00
F1-Score: 1.00

Folder Structure

Directory structure:

└── dagiteferi-ethiomart-amharic-nerllm-model/
   ├── README.md
   ├── requirements.txt
   ├── scraping_session.session
   ├── notebooks/
   │   ├── README.md
   │   ├── XLM_Fine-tune.ipynb
   │   ├── __init__.py
   │   ├── label.ipynb
   │   ├── preprocessing.ipynb
   │   └── token.ipynb
   ├── scripts/
   │   ├── README.md
   │   ├── __init__.py
   │   ├── labeling.py
   │   ├── preprocessing.py
   │   ├── scrapper.py
   │   └── __pycache__/
   ├── src/
   │   ├── __init__.py
   │   └── file_structure.py
   ├── tests/
   │   └── __init__.py
   └── .github/
       └── workflows/
           └── unittests.yml

Limitations and Future Work

Incomplete Coverage: Expand on tasks such as model comparison and interpretability.
Dataset Diversity: Incorporate more diverse sources to improve generalization.
Documentation: Enhance inline comments and examples for better clarity.

Acknowledgements

10 Academy Team
Telegram Channels for Data Collection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Amharic Named Entity Recognition (NER) System for EthioMart

Overview

Features

Dataset

Installation

Usage

Data Preparation

Model Training

Evaluation

Results

Folder Structure

Limitations and Future Work

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Amharic Named Entity Recognition (NER) System for EthioMart

Overview

Features

Dataset

Installation

Usage

Data Preparation

Model Training

Evaluation

Results

Folder Structure

Limitations and Future Work

Acknowledgements