-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Planning (MPCPlannerBase
) should consider done
#1660
Comments
Thanks for raising this, let me think about it, I guess we can do that in a vectorized way! |
@vmoens Yes, I also think the efficiency of these loops maybe low. However, I found a new issue again, which ask we to change the The new problem is that the default optim_tensordict = self.env.rollout(
max_steps=self.planning_horizon,
policy=policy,
auto_reset=False,
tensordict=optim_tensordict,
) but we need a def reward_truncated_rollout(self, policy, tensordict):
tensordicts = []
ever_done = torch.zeros(*tensordict.batch_size, 1, dtype=bool).to(self.device)
for i in range(self.planning_horizon):
tensordict = policy(tensordict)
tensordict = self.env.step(tensordict)
tensordict.get(("next", "reward"))[ever_done] = 0
tensordicts.append(tensordict)
ever_done |= tensordict.get(("next", "done"))
if ever_done.all():
break
batch_size = self.batch_size if tensordict is None else tensordict.batch_size
out_td = torch.stack(tensordicts, len(batch_size)).contiguous()
out_td.refine_names(..., "time")
return out_td I conduct the reward-truncation in this new |
There is a |
@vmoens Unfortunately not, when |
In theory, rollout will only reset the envs that are done. You can check in the doc how this is done: we assign a
It is a problem if the envs are reset (assuming this is done properly)? Do you mean that the |
Yep, you are right, my fault.
def _reset(self, tensordict: TensorDict, **kwargs) -> TensorDict:
tensordict = TensorDict(
{},
batch_size=self.batch_size,
device=self.device,
)
tensordict = tensordict.update(self.state_spec.rand())
tensordict = tensordict.update(self.observation_spec.rand())
return tensordict It seem to be available to make
My expectation for the Note that even if we implement the logic of handling In short, all the above problems are caused by trying to let
In my opinion, a special BTW, to make a more effective planning, we should |
|
You are right, and we can even do the reward-truncated in this special transform, and make a just tiny change to BTW, there are some tiny different between change from for _ in range(self.optim_steps):
...
container.set_(("stats", "_action_means"), best_actions.mean(dim=K_DIM, keepdim=True))
container.set_(("stats", "_action_stds"), best_actions.std(dim=K_DIM, keepdim=True))
... to for _ in range(self.optim_steps):
...
self.update_stats(
best_actions.mean(dim=K_DIM, keepdim=True),
best_actions.std(dim=K_DIM, keepdim=True),
container
)
...
def update_stats(self, means, stds, container):
self.alpha = 0.1 # should in __init__
new_means = self.alpha * container.get(("stats", "_action_means")) + (1 - self.alpha) * means
new_stds = self.alpha * container.get(("stats", "_action_stds")) + (1 - self.alpha) * stds
container.set_(("stats", "_action_means"), new_means)
container.set_(("stats", "_action_stds"), new_stds) to restore original behaviour, just set |
A new PR would defo make a lot of sense for this! |
Describe the bug
In the current implementation, all subclasses of
MPCPlannerBase
do not considerdone
thrown by env during the planning process, which means that MPC is invalid in a large class of environments. For example, in CEM:Specifically, one type of environment indicates that the agent has entered a dangerous state by throwing
done
(usually the reward is positive in non-dangerous states), including many environments of gym-mujoco, such as InvertedPendulum and Hopper. The MPC algorithm needs to identifydone
thrown by the environment and find the action sequence that maximizes the cumulative reward beforedone
.To Reproduce
Just try CEM on InvertedPendulum.
Reason and Possible fixes
For CEM, a simple fix chould be:
I'm more than happy to submit my changes, but they may require further style uniformity and standardization. At the same time, it is likely that there is a more efficient way.
Checklist
The text was updated successfully, but these errors were encountered: