The following code implements a term frequency inverse document frequency search engine for a corpus of documents to query on. The documents in the corpus will be ranked by their normalized tf-idf scores and the most relevant document to the query will be returned. The following weighting scheme is used: ltc.lnc
- Document: logarithmic tf, logarithmic idf, cosine normalization
- Query: logarithmic tf, no idf, cosine normalization
- Normalized TF-IDF scores for queries and documents
- Finds document name and score most relevant to query
- Tokenization, stop word removal, stemming
- Clone this repo locally
- Install and update relevant libraries
- Identify corpus directory and update 'corpusroot'
- Use the provided functions to perform document searches based on a query