Document Summarization Through K Means Clustering

Given data set of 850 sentences, we are going to create a 10 sentence summary.

For the 1st iteration, we will accomplish this by:

Split document into array of sentences
Tokenize, stem, stop word each sentence
Cluster sentences into 10 groups using K Means clustering by NLTK
Calculate the cosine similarity matrix for all sentence pairs in a given cluster, yielding 10 matrices, 1 per cluster
For each Cosine Similarity matrix, Sum scores of each row to determine which row (sentence) has highest similarity with the most documents
Return the sentence from each cluster that has highest similarity with other sentences of the cluster (returned by step 5)
Combine all 10 resulting high similarity sentences into 1 paragraph, which is to become the summary.
Sort the summary's sentences in the order that they appear in the original document, to maintain positional relationships of original document.
Return summary

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
modules		modules
report		report
hw4data-docs4sum.txt		hw4data-docs4sum.txt
main.py		main.py
readme.md		readme.md

Provide feedback