Part of Speech Tagger using Hidden Markov Model (HMM)
This repository contains a Jupyter notebook that demonstrates the implementation of a Part of Speech (POS) tagger using a Hidden Markov Model (HMM).
The notebook implements an HMM-based POS tagger which involves:
- Data Preparation: Reading and processing input data to extract words and their corresponding tags.
- Model Initialization: Creating an instance of the HMM model.
- State Creation: Defining states with emission probabilities.
- Adding Transitions: Setting up transitions between states based on observed data.
- Finalizing the Model: Baking the model to make it ready for use.
- Evaluation: Evaluating the tagger on a test dataset.
-
Initialization:
- Import necessary libraries and modules.
-
Data Preparation:
- Set up data streams to extract words and tags.
- Generate tag and word lists from the data stream.
- Calculate emission counts for the words given tags.
-
Most Frequent Class Tagger (MFCTagger):
- Implement a simple baseline tagger that assigns the most frequent tag for each word.
- Compare the performance of this baseline with the HMM tagger.
-
HMM Model Initialization:
- Create an instance of the HMM model.
-
State Creation:
- For each tag, calculate the emission probabilities and create states using these probabilities.
- Add these states to the HMM model.
-
Adding Transitions:
- Add transitions from the start state to each tag state.
- Add transitions from each tag state to the end state.
- Add transitions between tag states based on bigram counts.
-
Finalizing the Model:
- Finalize the HMM model by baking it.
-
Evaluation:
- Evaluate the HMM tagger on a test dataset.
- Compare the accuracy of the HMM tagger and the MFCTagger.
To use this notebook, follow these steps:
-
Clone the Repository:
git clone <repository_url> cd <repository_directory>
-
Set Up the Environment:
- Ensure you have Conda installed.
- Create and activate the Conda environment using the provided
hmm-tagger.yaml
file:conda env create -f hmm-tagger.yaml conda activate hmm-tagger
-
Run the Notebook:
- Launch Jupyter Notebook:
jupyter notebook
- Open
HMM Tagger.ipynb
and run the cells to execute the POS tagging process.
- Launch Jupyter Notebook:
-
Execute the Cells:
- Run the cells in the notebook sequentially to execute the POS tagging process.
- Most Frequent Class Tagger (MFCTagger): This section implements a simple baseline tagger that assigns the most frequent tag for each word. The
MFCTagger
class is provided to mock the interface of the HMM models so that they can be used interchangeably. - Evaluation: Evaluate the tagger on a test dataset and compare the accuracy of the HMM tagger and the MFCTagger.
For more information, please refer to the provided README from Udacity.