-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for MultiDiscrete and MultiBinary action spaces in PPO #30
Conversation
… with two choices
Hello, Btw, because of your good contributions, would you be interested in becoming a SBX maintainer? (so you won't have to fork the repo for fixing a bug/adding a feature) |
Sounds awesome, I'd be happy to become an SBX maintainer :) |
For built-in multi discrete, I think there are the Atari games? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks =)
Description
closes #19
Addresses #19. Adds support for
MultiDiscrete
andMultiBinary
action spaces toPPO
.Constructs a multivariate categorical distribution through Tensorflow Probability's
Independent
andCategorical
. Note that theCategorical
distribution requires every variable to have the same number of categories. Therefore, I pad the logits to the largest shape across the dimensions (pad by-inf
to ensure that these invalid actions have zero probability).MultiBinary
is handled as a special case ofMultiDiscrete
with two choices per categorical variable.Only one-dimensional action spaces are supported, so using, e.g.,
MultiDiscrete([[2],[3]])
orMultiBinary([2, 3])
will result in an exception (as in stable-baselines3).Testing
I added some tests (
tests/test_space
, similar to the tests in stable-baselines3) that check if there are errors during learning and that the correct exceptions are raised if PPO is used with multi-dimensionalMultiDiscrete
andMultiBinary
action spaces.To check whether there are issues with the learning performance, I compared the performance to stable-baselines3's PPO on
MultiDiscrete
andMultiBinary
action space environments. Since there are no environments with these action spaces in the classic Gym benchmarks, I used a discretized action version of Reacher and a binary action version of Acrobot for testing purposes (see the wrappers below).Test script for
MultiDiscrete
action spaces:Test script for
MultiBinary
action spaces:Results: sbx's and stable-baselines3's PPO have the same learning performance.
Motivation and Context
Types of changes
Checklist:
(The changelog seems to be in the stable-baselines3 repository, so I would need to create a separate PR for that)
(There is no separate documentation for sbx that I could update)
make format
(required)make check-codestyle
andmake lint
(required)make pytest
andmake type
both pass. (required)make doc
(required)Note: You can run most of the checks using
make commit-checks
.Note: we are using a maximum length of 127 characters per line