This repository contains an implementation of the techniques presented in the research paper "MathPrompter: Mathematical Reasoning Using Large Language Models" by Shima Imani, Liang Du, and Harsh Shrivastava from Microsoft Research. The implementation aims to replicate the improved performance of Large Language Models (LLMs) in arithmetic reasoning tasks using the MathPrompter technique.
This project is an independent implementation of the techniques described in the "MathPrompter: Mathematical Reasoning Using Large Language Models" paper by Microsoft researchers. It is not officially associated with the original authors or Microsoft. For the official and original research, please refer to the cited paper.
The original paper omits the units (i.e. "$50" is replaced by A). I believe that units are important for reasoning (assume the question is "What is 1m divided by 20cm?"). Therefore, I will leave the units as part of the question.
We have adopted customized few-shot prompts instead of those proposed in the original paper. These prompts have shown improved performance and consistency on platforms like Google Gemini Pro and Azure OpenAI GPT-3.5-Turbo.
Example prompt for arithmetic expression generation:
<Question>: John has A apples. He gives B apples to his friend. How many apples does John have left?
Answer = A - B
...
<Question>: {question}
However, you should experiment with different prompts and test which prompts work best for your specific LLM API, as effectiveness can vary depending on the model.
For evaluating our implementation, we benchmarked it on the SVAMP dataset (which contains 1000 math word problems). For the detailed evaluation process, please look at the README in the evaluation folder.
To manage costs effectively, each prompt was processed exactly once, an approach we term as self_consistency=1
. Given that each problem is addressed in a single model run, we set the temperature
parameter to 0. This ensures that the model response is deterministic, providing the most probable and stable output for each input without the variability that higher temperature settings would introduce.
The evaluation of MathPrompter on the SVAMP dataset achieved an accuracy of 63.9%.
The evaluation of MathPrompter using the SVAMP dataset reveals distinct strengths across different methodologies. The comprehensive approach (MathPrompter Total) demonstrates respectable accuracy (63.9%) with a low hallucination rate (10.9%), indicating reliable problem-solving capabilities. Specialized methods such as Algebraic Only and Python Only exhibit higher accuracy rates of 77.3% and 70.5% respectively but also show increased hallucination rates. This points to their heightened sensitivity and potential to overfit specific problem types.
For a comparative analysis of how our results hold up against current state-of-the-art methodologies, you can visit the PaperWithCode Leaderboard for the SVAMP benchmark.
Notably, when considering only models that operate without the use of additional training data, our implementation ranks 4th.
Clone this repository and navigate into the project directory. Install the required dependencies:
git clone https://github.com/RamonKaspar/MathPrompter.git
cd MathPrompter
pip install -r requirements.txt
To use the MathPrompter, run the main script or import functions directly into your Python projects:
python main.py
Detailed documentation on function usage and parameters can be found in the docstrings within the code.
For instructions on setting up a connection to an LLM API, please consult the README.md
file located in the llm_inference
directory.
While the current implementation of MathPrompter provides a foundational approach to solving algebraic questions, there are several enhancements and optimizations that can further improve its performance and functionality:
-
Parallelization of API Calls:
- Implement parallel processing to handle API calls more efficiently. This could significantly speed up computations by making simultaneous requests.
-
Probabilistic Results Implementation:
- Refine the result evaluation mechanism to return a probability along with each result. Currently, a consensus needs to be reached across all iterations for a result to be returned.
-
Use Different Model Parameters:
- We typically prompt the LLM
$N$ times (default is$N=5$ ) to generate algebraic expressions and Python code. By varying model parameters such as temperature and top_p in future iterations, we could potentially obtain more diverse answers.
- We typically prompt the LLM
This implementation is based on the following work:
@article{imani2023mathprompter,
title={MathPrompter: Mathematical Reasoning Using Large Language Models},
author={Imani, Shima and Du, Liang and Shrivastava, Harsh},
journal={arXiv preprint arXiv:2303.05398},
year={2023}
}
For more detail and to read the paper, you can access it here.