Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Multi Objective Reward Discussion #110

Closed
JaCoderX opened this issue Jun 10, 2019 · 4 comments
Closed

Multi Objective Reward Discussion #110

JaCoderX opened this issue Jun 10, 2019 · 4 comments

Comments

@JaCoderX
Copy link
Contributor

I have been recently thinking on how to incorporate risk adjusted returns, like sharpe ratio, as a way to form a richer and more complex reward function.
The idea is to find the optimal policy for high returns but in a way that also minimize risk.

The currently available way to play around with the idea is to create a custom reward function of the following form:

Reward = a * profit + b * risk

where 'a' and 'b' are hyper parameters that need to be manually crafted.
the risk itself can be obtained by using backtrader analyzer (didn't checked yet on how it integrates, but i think it is possible)

During a survey on the subject, I came across the following paper:
Generalizing Across Multi-Objective Reward Functions in Deep Reinforcement Learning

Many reinforcement-learning researchers treat the reward function as a part of the environment, meaning that the agent can only know the reward of a state if it encounters that
state in a trial run. However, we argue that this is an unnecessary limitation and instead,
the reward function should be provided to the learning algorithm. The advantage is that
the algorithm can then use the reward function to check the reward for states that the
agent hasn’t even encountered yet. In addition, the algorithm can simultaneously learn
policies for multiple reward functions. For each state, the algorithm would calculate the
reward using each of the reward functions and add the rewards to its experience replay
dataset. The Hindsight Experience Replay algorithm developed by Andrychowicz et al.
(2017) does just this, and learns to generalize across a distribution of sparse, goal-based
rewards. We extend this algorithm to linearly-weighted, multi-objective rewards and learn
a single policy that can generalize across all linear combinations of the multi-objective reward.

according to the paper it is possible to learn the relation between different reward objective as part of the learning process in an off-policy setup using a more general form of Hindsight Experience Replay. where, final reward = W * (reward vector)

I think it can be a powerful tool that will allow a way to control risk based on scenario.

thoughts and ideas on the topic would be appreciated

@Kismuz
Copy link
Owner

Kismuz commented Jun 10, 2019

@JacobHanouna , I would look at 'Lower Partial Moment'-based risk measures:
http://w.performance-measurement.org/KaplanKnowles2004.pdf

@JaCoderX
Copy link
Contributor Author

JaCoderX commented Jun 12, 2019

@Kismuz Thank you for sharing, this is a very interesting paper. I was interested before on the Sortino and Omega ratio to model drawdowns. so cool to have one formalism that unite them together.

Backtrader actually have limited support for the PyFolio project (cerebro.addanalyzer(bt.analyzers.PyFolio)). which can calculate Sortino and Omega ratio directly or the parameters used in Kappa if one wants to implement higher orders risk terms.

I think that the most challenging part is what to do when you have a single/family of those risk ratio. because in the end we have only one reward function, each part of the reward needs to be scalarized and weighted so we can construct the final reward value.

The paper above offers a framework to tackle this challenge by dynamically learning the weights of relations between each part of the reward instead of manually trying to find a static reward shape between all the reward parts.

@Kismuz, I think it might be worth looking into as part of the design for BTGym 2.0

@Kismuz
Copy link
Owner

Kismuz commented Jun 12, 2019

@JacobHanouna,

def lpm(x, t, n):
    """
    Lower partial moment of the data.

    Args:
        x:      array of size [n], observations of random variable
        t:      float scalar, threshold value
        n:      moment order

    Returns:
        float scalar, lower partial moment of order `n` with threshold `t`
    """
    return (np.clip(x - t, a_min=None, a_max=0.0) ** n).mean()


def kappa_ratio(returns, threshold, order):
    """
    Estimates Kappa metric for given data.

    Args:
        returns:     1d array-like of empirical returns
        threshold:   float
        order:       int, moment order

    Returns:
        float scalar
    """
    if not isinstance(returns, np.ndarray):
        returns = np.asarray(returns)

    if not isinstance(threshold, float):
        threshold = float(threshold)

    return (returns - threshold).mean() / (abs(lpm(returns, threshold, order)) ** ( 1/ order))

omega = kappa(n=1) + 1
sortino = kappa(n=2)

the only trouble with these when converting to loss or reward term is that kappa uses whole set of data to make single estimation while loss/reward are usually estimated from single point; in case of classification loss it can be tackled in a SGD-like manner (make estimation from i.i.d batch instead of whole dataset); in case of reward it can be tricky.

@JaCoderX
Copy link
Contributor Author

JaCoderX commented Jun 14, 2019

@Kismuz, first thank you for sharing Kappa paper and code.

the only trouble with these when converting to loss or reward term is that kappa uses whole set of data to make single estimation while loss/reward are usually estimated from single point;

Ok, I agree.
Let me try a different approach for using risk adjusted values.
as you mentioned, risk adjusted like kappa can be seen as an episode summary of the combined risk of all the actions that were taken in that episode.

according to the paper, Hindsight Experience Replay, maybe we can use the kappa summary value of each episode as a goal we want to achieve. so when we optimize the policy we optimize with respect to f(s, a, g)?

from the paper:

Instead of shaping the reward we propose a different solution which does not require any domain knowledge. Consider an episode with a state sequence s1, . . . , sT and a goal g != s1, . . . , sT which implies that the agent received a reward of −1 at every timestep. The pivotal idea behind our approach is to re-examine this trajectory with a different goal — while this trajectory may not help us learn how to achieve the state g, it definitely tells us something about how to achieve the state sT . This information
can be harvested by using an off-policy RL algorithm and experience replay where we replace g in
the replay buffer by sT . In addition we can still replay with the original goal g left intact in the replay
buffer. With this modification at least half of the replayed trajectories contain rewards different from
−1 and learning becomes much simpler

The best results for accomplishing the task they tried to perform was without using a reward shaping.

@Kismuz, do you think such approach is viable here as well?

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

2 participants