Skip to content

Indexes and queries paragraph in a book, using NLTK and Gensim libraries for text processing.

Notifications You must be signed in to change notification settings

grooleivsgard/information-retrieval

Repository files navigation

document-processing

Assignment 3 in TDT4117 Information Retrieval at NTNU.

Gensim-NLTK Paragraph Processor

This repository presents tools for indexing and querying paragraphs from the book 'An Inquiry into the Nature and Causes of the Wealth of Nations' by Adam Smith using the Gensim library.

Features:

  • Document Partitioning: Splits the book into individual paragraphs, each saved as separate files.
  • Text Preprocessing: Utilizes NLTK for tokenization, stemming, and other advanced preprocessing techniques.
  • Indexing and Querying: Indexes the paragraphs and query them to retrieve the most relevant content based on the LSI (over TF-IDF) model.
  • Visualization: Plots a frequency distribution graph of the top 15 words after preprocessing.

About

Indexes and queries paragraph in a book, using NLTK and Gensim libraries for text processing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published