I wanted to create a 'prototype' spell checker by fine-tuning different LLMs models. I prepared some data and also downloaded some public datasets from Kaggle:
https://www.kaggle.com/datasets/bittlingmayer/spelling?resource=download&select=aspell.txt
You will find:
· archive.zip : the dataset I downloaded from https://www.kaggle.com/datasets/bittlingmayer/spelling?resource=download&select=aspell.txt
· .csv file: a sample of mispelled sentences and well spelled sentences.
· .ipynb : Python code to fine-tune a LLM and to evaluate it as well. I also used ChatGPT to improve/develop it.
I have also explored another prototype you may find in my repo: Spellchecker_LLM02. It is a more finegrained approach, where more sentences are used, instead of a mix of words and sentences, which may had led to problems because of different length of each sample during training.
Some other interesting and more advanced projects:
NeuSpell:
https://github.com/neuspell/neuspell#Datasets
Spelling corrector:
https://www.kaggle.com/datasets/bittlingmayer/spelling?resource=download&select=aspell.txt