

(This choice of notation corresponds to what we later implement in code.) Thus, when d=True-which is to say, when is a terminal state-the Q-function should show that the agent gets no additional rewards after the current state.

Here, in evaluating, we’ve used a Python convention of evaluating True to 1 and False to zero. Then, instead of running an expensive optimization subroutine each time we wish to compute, we can approximate it with. This allows us to set up an efficient, gradient-based learning rule for a policy which exploits that fact. And since it would need to be run every time the agent wants to take an action in the environment, this is unacceptable.īecause the action space is continuous, the function is presumed to be differentiable with respect to the action argument.

Using a normal optimization algorithm would make calculating a painfully expensive subroutine. (This also immediately gives us the action which maximizes the Q-value.) But when the action space is continuous, we can’t exhaustively evaluate the space, and solving the optimization problem is highly non-trivial.

When there are a finite number of discrete actions, the max poses no problem, because we can just compute the Q-values for each action separately and directly compare them. But what does it mean that DDPG is adapted specifically for environments with continuous action spaces? It relates to how we compute the max over actions in.
