Information Retrieval from Semi-Structured Data
In this project, I have created an IR system for semi-structured data, in this case CSV (comma separated values) data. The data-set contains 8807 records of shows on Netflix - including the show name, director's name, cast, date of release, length/duration, genre and the plot.
The aim is to build a search engine for netflix shows, based on topics learnt throughout the information retrieval course and some concepts of natural language processing.
This project is also an example of domain-specific information retrieval, as while creating the IR system, I took into account the kind of data present in the CSV file and the kinds of queries a user may make.
First clone the repo and using pip
install all the dependencies as mentioned in requirements.txt in a virtual environment.
For development mode, run
./dev
For production mode, run
./prod
These commands will do the following
- Clean the raw CSV data
- Train the Doc2Vec model
- Build Indexes
- Run the flask server in dev/prod mode
The web application will be served at http://localhost:5000