🤖 ✨ 🔍 Generate precise, realistic user-focused search queries from text 🛒 🚀 📊
This project contains the code used to create the fine-tuned model GenQ. This model has been specifically designed to generate high-quality customer-like queries for e-commerce products, achieving improved performance compared to the base model.
This repository serves as a comprehensive resource for:
-
Data Preprocessing: Scripts and utilities for preparing the dataset used in fine-tuning, ensuring a robust and effective training process.
-
Model Fine-Tuning: Code and configurations for fine-tuning the base model on the customized dataset.
-
Performance Insights: Configurations and examples showcasing the model's performance improvements and applications.
By leveraging the GenQ model, e-commerce platforms and others can enhance search quality and generate more relevant queries tailored to their products.
Whether you're looking to understand the data preparation process, fine-tune your own model, or integrate this solution into your workflow, this repository has you covered.
Model Name: Fine-Tuned Query-Generation Model
Model Type: Text-to-Text Transformer
Architecture: Based on a pre-trained transformer model: BeIR/query-gen-msmarco-t5-base-v1
Primary Use Case: Generating accurate and relevant human-like search queries from product descriptions or articles
Dataset: smartcat/Amazon-2023-GenQ
There are four models in our collection that are trained differently, with T5-GenQ-TDC-v1 being our best performing model.
- T5-GenQ-T-v1: Trained on only the product titles
- T5-GenQ-TD-v1: Trained on titles + descriptions of the products
- T5-GenQ-TDE-v1: Trained on titles + descriptions of the products and a set of products with titles only (2x of the dataset)
- T5-GenQ-TDC-v1: Trained on titles + descriptions of the products and a subset of products with titles only that had a similarity score with short queries above 85%
- max_input_length: 512
- max_target_length: 30
- batch_size: 48
- num_train_epochs: 8
- evaluation_strategy: epoch
- save_strategy: epoch
- learning_rate: 5.6e-05
- weight_decay: 0.01
- predict_with_generate: true
- load_best_model_at_end: true
- metric_for_best_model: eval_rougeL
- greater_is_better: true
- logging_startegy: epoch
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics used for evaluating automatic summarization and machine translation in NLP. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. ROUGE metrics range between 0 and 1, with higher scores indicating higher similarity between the automatically produced summary and the reference.
In our evaluation, ROUGE scores are scaled to resemble percentages for better interpretability. The metric used in the training was ROUGE-L.
The results of our model variations are:
Model | Epoch | Step | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum |
---|---|---|---|---|---|---|
T5-GenQ-T-v1 | 7.0 | 29995 | 75.2151 | 54.8735 | 74.5142 | 74.5262 |
T5-GenQ-TD-v1 | 8.0 | 34280 | 78.2570 | 58.9586 | 77.5308 | 77.5466 |
T5-GenQ-TDE-v1 | 8.0 | 68552 | 76.9075 | 57.0980 | 76.1464 | 76.1502 |
T5-GenQ-TDC-v1 | 8.0 | 41448 | 80.0754 | 61.5974 | 79.3557 | 79.3427 |
A6000 GPU:
- Memory Size: 48 GB
- Memory Type: GDDR6
- CUDA: 12.4
To get started, clone the repository and install the required dependencies:
git clone https://github.com/smartcat-labs/product2query.git
For installing and setting up poetry: Poetry Documentation
To maintain organization, configure Poetry to create a virtual environment in the project's directory:
poetry config virtualenvs.in-project true
After installing and setting up poetry, run:
poetry install --no-root
to install all necessary dependencies
For running the training, prepare the config.yaml file. If you don't want to modify it you can simply run the training with:
python -m modules.train.train -c config/config.yaml
If you want to test out the training on a sample of the dataset, set the dev
flag to True
in the config.yaml or simply run the training with:
python -m modules.train.train -c config/test_config.yaml
The best three models will be saved to results/train_date_time/models
by default.
The checkpoint with the highest ROUGE-L score, which you can check in evaluation_metrics.csv, should be your best performing model.
To check out each checkpoint, you can run the evaluation.
The evaluation consists of generating queries with two models and calculating the results of each ROUGE metric. In our case, we ran the evaluation with the pre-trained model and our fine-tuned model.
For running the evaluation, prepare the eval_config.yaml file. You must set the model_paths
in the file to your checkpoint path to test out your model. If you don't want to modify the file you can simply run the evaluation with:
python -m modules.eval.model_eval -c config/eval_config.yaml
This will run the evaluation with our fine-tuned model by default.
After it's finished, you can look at the results in the generated_results.csv saved to results/eval_date_time
by default.
For further analysis use the results_analysis.py
script with your generated_results.csv
to create plots and see specific cases where your model had better or worse results.
To run the script, specify in the analysis_config.yaml
file the path to your generated results, set the parameters to your liking and run the script with:
python -m modules.eval.results_analysis config/analysis_config.yaml
This model is designed to enhance search functionality by generating user-like search queries based on textual descriptions. It is particularly suited for applications where product/item text is the primary input, and the goal is to create concise, relevant queries that align with user search intent.
Target Query | Before Fine-tuning | After Fine-tuning |
---|---|---|
flannel pajama set women's | what to wear with a pajama set | women's plaid pajama set |
custom name necklace | what is casey name necklace | personalized name necklace |
Large Satin Sleep Cap | what is the size of a silk bonnet | satin sleep cap |
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("smartcat/T5-GenQ-TDC-v1")
tokenizer = AutoTokenizer.from_pretrained("smartcat/T5-GenQ-TDC-v1")
description = "Silver-colored cuff with embossed braid pattern. Made of brass, flexible to fit wrist."
inputs = tokenizer(description, return_tensors="pt", padding=True, truncation=True)
generated_ids = model.generate(inputs["input_ids"], max_length=30, num_beams=4, early_stopping=True)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
Average scores by model
Density comparison
Histogram comparison
Scores by generated query length
Semantic similarity distribution
Semantic similarity score against ROUGE scores
Average scores by model
Density comparison
Histogram comparison
Scores by generated query length
Semantic similarity distribution
Semantic similarity score against ROUGE scores
Average scores by model
Density comparison
Histogram comparison
Scores by generated query length
Semantic similarity distribution
Semantic similarity score against ROUGE scores
Average scores by model
Density comparison
Histogram comparison
Scores by generated query length
Semantic similarity distribution
Semantic similarity score against ROUGE scores
To assess the performance of our fine-tuned query generation model, we conducted evaluations on a dataset containing real user queries, which was not part of the fine-tuning data. The goal was to verify the model's generalizability and effectiveness in generating high-quality queries for e-commerce products.
For this experiment, we utilized milistu/amazon-esci-data. From this dataset, we specifically selected queries that were in English and closely aligned with the corresponding product descriptions.
Average scores by model
Semantic similarity distribution
Better Generalization: Even though the fine-tuned model was not trained on this dataset, it generalizes better than the original model on real user queries.
Improved Query Quality: The fine-tuned model produces more relevant, structured, and user-aligned queries which is critical for enhancing search and recommendation performance in e-commerce.
Robust Semantic Alignment: Higher semantic similarity scores indicate that queries generated by the fine-tuned model better match user intent, leading to improved search and retrieval performance.