GraphRAG: Graph-based Retrieval Augmented Generation

GraphRAG is a Python implementation that combines graph theory with Retrieval Augmented Generation (RAG) to improve information retrieval and generation for large language models.

Introduction

GraphRAG enhances traditional RAG by using graph structures to represent relationships between text chunks, enabling more contextually relevant retrieval for language model queries.

Installation

Clone the repository:

git clone https://github.com/hr1juldey/SimpleGRAPHRAG.git
cd graphrag

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Prepare your text data (Markdown or PDF format). Keep it in the data/ folder.
Open Example.ipynb and run the cells one by one.

File Structure

SimpleGRAPHRAG/
├── Example.ipynb
├── data/
│   ├── st5.md
│   └── other_text_files.md
├── graphs/
│   └── st5.gpickle
├──requirements.txt
└── README.md

Dependencies

networkx
numpy
matplotlib
PyPDF2
nltk
rake_nltk
ollama

Core Components

Text Processing: Converts input text into a hierarchical structure.
Graph Creation: Builds a NetworkX graph from the processed text.
Embedding Generation: Uses Ollama to generate embeddings for text chunks.
Retrieval: Finds relevant chunks based on query similarity.
Answer Generation: Uses a language model to generate answers based on retrieved context.

Detailed Function Explanations

`read_file(file_path)`

Reads content from Markdown or PDF files.

Parameters:

file_path: Path to the input file

Returns:

String containing the file content

`detect_table_of_contents(text)`

Attempts to detect a table of contents in the input text.

Parameters:

text: Input text

Returns:

List of detected table of contents entries

`split_text_into_sections(text)`

Splits the input text into a hierarchical structure and creates a graph.

Parameters:

text: Input text

Returns:

NetworkX graph representing the text structure

`save_graph(graph, filepath)` and `load_graph(filepath)`

Save and load graph structures to/from disk using pickle.

Parameters:

graph: NetworkX graph object
filepath: Path to save/load the graph

`get_embedding(text, model="mxbai-embed-large")`

Generates embeddings for given text using Ollama API.

Parameters:

text: Input text
model: Embedding model to use

Returns:

Embedding vector

`calculate_cosine_similarity(chunk, query_embedding, embedding)`

Calculates cosine similarity between chunk and query embeddings.

Parameters:

chunk: Text chunk
query_embedding: Query embedding vector
embedding: Chunk embedding vector

Returns:

Tuple of (chunk, similarity score)

`find_most_relevant_chunks(query, graph)`

Finds the most relevant chunks in the graph based on the query.

Parameters:

query: Input query
graph: NetworkX graph of the text

Returns:

List of tuples containing (chunk, similarity score)

`answer_query(query, graph)`

Generates an answer to the query using the graph and a language model.

Parameters:

query: Input query
graph: NetworkX graph of the text

Returns:

Generated answer string

`visualize_graph(graph)`

Visualizes the graph structure using matplotlib.

Parameters:

graph: NetworkX graph object

Example Usage

# Load a graph
graph = load_graph("./graphs/sample_graph.gpickle")

# Ask a question
query = "What is the significance of the cherry seed in the story?"
answer = answer_query(query, graph)
print(f"Question: {query}")
print(f"Answer: {answer}")

Visualization

The visualize_graph function can be used to create a visual representation of the graph structure. This is useful for small to medium-sized graphs but may become cluttered for very large texts.

Limitations and Future Work

The current implementation may be slow for very large texts.
Graph visualization can be improved for better readability.
More advanced graph algorithms could be implemented for better retrieval.
Integration with other embedding models and language models could be explored.

Feel free to contribute to this project by submitting pull requests or opening issues for bugs and feature requests. You can also read about projects like these in our website AI&U

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
graphs		graphs
plots		plots
.gitignore		.gitignore
1595345731-great-stories-for-children---ruskin-bond.pdf		1595345731-great-stories-for-children---ruskin-bond.pdf
Example.ipynb		Example.ipynb
ReadME.md		ReadME.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphRAG: Graph-based Retrieval Augmented Generation

Table of Contents

Introduction

Installation

Usage

File Structure

Dependencies

Core Components

Detailed Function Explanations

`read_file(file_path)`

`detect_table_of_contents(text)`

`split_text_into_sections(text)`

`save_graph(graph, filepath)` and `load_graph(filepath)`

`get_embedding(text, model="mxbai-embed-large")`

`calculate_cosine_similarity(chunk, query_embedding, embedding)`

`find_most_relevant_chunks(query, graph)`

`answer_query(query, graph)`

`visualize_graph(graph)`

Example Usage

Visualization

Limitations and Future Work

About

Releases

Packages

Languages

hr1juldey/SimpleGRAPHRAG

Folders and files

Latest commit

History

Repository files navigation

GraphRAG: Graph-based Retrieval Augmented Generation

Table of Contents

Introduction

Installation

Usage

File Structure

Dependencies

Core Components

Detailed Function Explanations

read_file(file_path)

detect_table_of_contents(text)

split_text_into_sections(text)

save_graph(graph, filepath) and load_graph(filepath)

get_embedding(text, model="mxbai-embed-large")

calculate_cosine_similarity(chunk, query_embedding, embedding)

find_most_relevant_chunks(query, graph)

answer_query(query, graph)

visualize_graph(graph)

Example Usage

Visualization

Limitations and Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`read_file(file_path)`

`detect_table_of_contents(text)`

`split_text_into_sections(text)`

`save_graph(graph, filepath)` and `load_graph(filepath)`

`get_embedding(text, model="mxbai-embed-large")`

`calculate_cosine_similarity(chunk, query_embedding, embedding)`

`find_most_relevant_chunks(query, graph)`

`answer_query(query, graph)`

`visualize_graph(graph)`

Packages