imputingTokens

Some quick perusing of literature online pointed me towards thinking about this problem as a missing imputation problem. The goal was to use the available data from the corrupted tokens to find the missing letters. I chose to tackle the problem by predicting the full missing word instead of the specific missing letter. My first task was creating additional data attributes from the corrupted and training datasets that could be used to find patterns within the data to exploit. I chose to treat the prediction as a classification problem and chose to use Support vector machines (SVMs) mainly because they are still effective in cases where number of dimensions is greater than the number of samples. I thought this was important because I expected certain words like the to be much more common than others without much context to the surrounding words but that there would still be words like tie with the same structure but more sensitivity to surrounding words. Finally, to run the training models on my laptop in an efficient manner I subset the training sets into subsets based on the length of the corrupted token and the first letter in the corrupted token. I felt comfortable with this since it essentially weighted the first letter and length of the word as the most important variables which seems reasonable to me when attempting to predict a word. I also attempted using the other known letters but found that my model overfit. Using t#e as an example the model essentially defaulted to the most common word with the letters and did not weigh the surrounding word features. The final model has a few weaknesses that need to be corrected. The largest weakness is that the model will not have a prediction for first letter and length combinations that are not in the training dataset. I tried a naïve solution of replacing every # with e since it is the most common letter in the English language then running a spell checker function against it, but the results were lackluster. With additional time I would use a larger vocabulary of training data to ensure I have a model prediction for every combination of first letter and length. I would also like to create a more generic model without using the first letter and length as subsets. This would require creating more features and fine tuning the model parameters.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Chapter 25 Missing-data imputation.pdf		Chapter 25 Missing-data imputation.pdf
LICENSE		LICENSE
README.md		README.md
corrupted_tokens.txt		corrupted_tokens.txt
outputCarlos.txt		outputCarlos.txt
training_tokens.txt		training_tokens.txt
wqp_researcher_SVM_Carlos.ipynb		wqp_researcher_SVM_Carlos.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

imputingTokens

About

Releases

Packages

Languages

License

cbonilla-catalogue/imputingTokens

Folders and files

Latest commit

History

Repository files navigation

imputingTokens

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages