Welcome to the GCP-based Retrieval-Augmented Generation (RAG) System repository. This project leverages Google Cloud Platform (GCP) to build a scalable RAG system for handling large amounts of data. The data, originating from various formats and conditions, undergoes preprocessing before being ingested into a GCP Datastore. The system uses the Gemini API for data search and summary, with a user interface built using Streamlit.
This project involves several key steps:
- Data Preprocessing: Convert and format data files from various formats (doc, pdf) to a consistent format.
- Local Database Creation: Build a local version of the company's database.
- Data Ingestion: Sequentially process the files and make necessary format changes.
- Cloud Storage: Store the processed data in GCP Cloud Buckets.
- Datastore Creation: Use GCP Console to create a scalable Datastore, serving as the vector database.
- API Integration: Utilize the Gemini API for data search and summary generation.
- User Interface: Implement a Streamlit-based UI for interaction.
-
Clone the repository:
git clone https://github.com/your-username/your-repository.git cd your-repository
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the dependencies:
pip install -r requirements.txt
-
Install Google Cloud SDK:
Follow the instructions to install the Google Cloud SDK.
-
Preprocess and Ingest Data:
- Modify and run the scripts in the
Doc_ingestion
folder to preprocess and format your data.
- Modify and run the scripts in the
-
Upload Data to GCP:
- Store the processed data in GCP Cloud Buckets.
-
Create Datastore:
- Use the GCP Console to create a Datastore for your project. Remember to replace
ProjectID
,Location
, andDatastore
with your project-specific details.
- Use the GCP Console to create a Datastore for your project. Remember to replace
-
Run the Application:
- Use Streamlit to launch the UI and interact with your data.
streamlit run main.py
- Replace
ProjectID
,Location
, andDatastore
with your specific project details when setting up the GCP components. - Ensure all dependencies are installed using the
requirements.txt
file. - Google Cloud SDK must be installed and authenticated for proper GCP interaction.
For any further questions or issues, feel free to open an issue on this repository.