A blazingly fast search engine for arXiv papers using TF-IDF ranking and parallel processing. Built with Rust because I like living life on the edge (and really fast compile times).
This project creates a search engine for arXiv papers that:
- Indexes papers using TF-IDF (Term Frequency-Inverse Document Frequency)
- Processes data in parallel using Rayon
- Uses memory-efficient data structures like DashMap
- Stems words to improve search accuracy (e.g., "running" → "run")
- Saves the processed model for lightning-fast subsequent searches
Example search for "Philosophy":
Query Results (took 67.67ms):
Score: 2.0249, Title: Extended Version of "The Philosophy of the Trajectory Representation of Quantum Mechanics"
Score: 0.7087, Title: Kurt Goedel and His Universe
Score: 0.6704, Title: Science and Philosophy: A Love-Hate Relationship
...
First, we clean the raw arXiv dataset using a Python script that:
- Reads the JSON dataset line by line
- Processes each paper's metadata
- Outputs a cleaned version ready for indexing
The main Rust program then:
-
Tokenizes paper abstracts by:
- Removing punctuation
- Converting to lowercase
- Removing stop words (common words like "the", "and", "is")
- Stemming words to their root form
-
Creates two main data structures:
- A word map (word → set of paper IDs containing that word)
- A paper map (paper ID → paper details)
-
Uses parallel processing to speed up indexing:
- Rayon for parallel iteration
- DashMap for concurrent hash map updates
For each search query:
- Tokenizes the search terms
- For each term, calculates:
- Term Frequency (TF): How often the word appears in a paper
- Inverse Document Frequency (IDF): How unique the word is across all papers
- Combines scores to rank papers by relevance
- Rust (latest stable)
- Python 3.7+
- At least 16GB RAM
-
Download the dataset: Go to ArXiv Dataset on Kaggle
-
Clone the repository:
git clone https://github.com/NikSchaefer/arxiv-search
cd arxiv-search
- Clean the dataset:
python preprocess.py
- Build and run:
cargo build --release
cargo run --release
The first run will take a while (about 80 minutes on an M3 MacBook Pro) as it builds the index. Subsequent runs will be much faster as they load the saved model.
- Initial indexing: ~80 minutes on M3 MacBook Pro
- Query time: ~70ms average
- Model size: Depends on dataset size, but expect several GB
- Add web interface
- Implement more advanced ranking algorithms
- Add support for boolean queries (AND, OR, NOT)
- Add citation graph analysis
- Make it even faster (because why not?)
Found a bug? Have an idea? Feel free to open an issue or submit a PR. I'm always looking to make things better (or at least more interesting).
MIT - Do whatever you want with it, just don't blame me if your computer starts mining bitcoin instead of searching papers.