GraphRAG is a Python implementation that combines graph theory with Retrieval Augmented Generation (RAG) to improve information retrieval and generation for large language models.
- Introduction
- Installation
- Usage
- File Structure
- Dependencies
- Core Components
- Detailed Function Explanations
- Example Usage
- Visualization
- Limitations and Future Work
GraphRAG enhances traditional RAG by using graph structures to represent relationships between text chunks, enabling more contextually relevant retrieval for language model queries.
-
Clone the repository:
git clone https://github.com/hr1juldey/SimpleGRAPHRAG.git cd graphrag
-
Install the required dependencies:
pip install -r requirements.txt
- Prepare your text data (Markdown or PDF format). Keep it in the
data/
folder. - Open
Example.ipynb
and run the cells one by one.
SimpleGRAPHRAG/
├── Example.ipynb
├── data/
│ ├── st5.md
│ └── other_text_files.md
├── graphs/
│ └── st5.gpickle
├──requirements.txt
└── README.md
- networkx
- numpy
- matplotlib
- PyPDF2
- nltk
- rake_nltk
- ollama
- Text Processing: Converts input text into a hierarchical structure.
- Graph Creation: Builds a NetworkX graph from the processed text.
- Embedding Generation: Uses Ollama to generate embeddings for text chunks.
- Retrieval: Finds relevant chunks based on query similarity.
- Answer Generation: Uses a language model to generate answers based on retrieved context.
Reads content from Markdown or PDF files.
Parameters:
file_path
: Path to the input file
Returns:
- String containing the file content
Attempts to detect a table of contents in the input text.
Parameters:
text
: Input text
Returns:
- List of detected table of contents entries
Splits the input text into a hierarchical structure and creates a graph.
Parameters:
text
: Input text
Returns:
- NetworkX graph representing the text structure
Save and load graph structures to/from disk using pickle.
Parameters:
-
graph
: NetworkX graph object -
filepath
: Path to save/load the graph
Generates embeddings for given text using Ollama API.
Parameters:
-
text
: Input text -
model
: Embedding model to use
Returns:
- Embedding vector
Calculates cosine similarity between chunk and query embeddings.
Parameters:
chunk
: Text chunkquery_embedding
: Query embedding vectorembedding
: Chunk embedding vector
Returns:
- Tuple of (chunk, similarity score)
Finds the most relevant chunks in the graph based on the query.
Parameters:
-
query
: Input query -
graph
: NetworkX graph of the text
Returns:
- List of tuples containing (chunk, similarity score)
Generates an answer to the query using the graph and a language model.
Parameters:
-
query
: Input query -
graph
: NetworkX graph of the text
Returns:
- Generated answer string
Visualizes the graph structure using matplotlib.
Parameters:
graph
: NetworkX graph object
data:image/s3,"s3://crabby-images/67d0b/67d0bad294b543bb6d958defda6508b356fcf2f6" alt="graph visialisation of a story by Ruskin Bond GRAPHRAG"
# Load a graph
graph = load_graph("./graphs/sample_graph.gpickle")
# Ask a question
query = "What is the significance of the cherry seed in the story?"
answer = answer_query(query, graph)
print(f"Question: {query}")
print(f"Answer: {answer}")
The visualize_graph
function can be used to create a visual representation of the graph structure. This is useful for small to medium-sized graphs but may become cluttered for very large texts.
- The current implementation may be slow for very large texts.
- Graph visualization can be improved for better readability.
- More advanced graph algorithms could be implemented for better retrieval.
- Integration with other embedding models and language models could be explored.
Feel free to contribute to this project by submitting pull requests or opening issues for bugs and feature requests. You can also read about projects like these in our website AI&U