Q-Star Q* in Reinforcement Learning
Author: Miquel Noguer i Alonso - Founder at AI Finance Institute
Date: November 23, 2023
Q* is the currently accepted notation for the Optimal Action Value Function in RL.
Q* RL algorithm might be using AI generated data (Logic + Maths) and teaches the LLM to solve multi-step logic problems. Q* might be applied to GPT-5, giving it excellent reasoning and retrieval skills.
The biggest gains on reasoning come from strong reward models, as opposed to more SFT data or tools.
Much of (unpublished) research is now focused on finding a general planning algorithm for LLMs, i.e. some equivalent of the dlPFC. So PLANNING is the name of the game.
In the literature, we have seen different approaches to teaching math to AI models like Transformers + Beam Search or Large language models, which are capable of solving tasks that require complex multistep reasoning by generating solutions in a step-by-step chain-of-thought format.
One effective method in the second involves training reward models to discriminate between desirable and undesirable outputs.
Access this document for a comprehensive overview of the Q-Star (Q*) concept in reinforcement learning, which delves into its mathematical formulation, significance, and the methods employed for approximation in learning algorithms.
In the literature we see two distinct methods
for training reward models: outcome supervision & process supervision.