Hi, In this I have used two corpora :
-
part of coca corpus (Corpus of Contemporary American English) it is an english language Corpus and
-
corona virus corpus
It has 11946296 words
I performed these analyses :
1)Word frequency analysis 2)Parts of Speech tagging 3)chunking and chinking 4)Word feature extraction 5)ngrams 6)Named Entity Recognition
The outputs are attached under outputs folder The codes are attached under codes folder The corpora are attached under new corpus folder
This is the directory structure in which these are the subfolders:
*new corpus - consists of all the .txt files of the corpus
*code - consists of all .py files
*outputs - consists of all outputs of .py files