-
Notifications
You must be signed in to change notification settings - Fork 353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Faster SliceSampler._tensor_slices_from_startend
#2423
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2423
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 7 Unrelated FailuresAs of commit fec4f40 with merge base 57f0580 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -1076,9 +1076,24 @@ def _tensor_slices_from_startend(self, seq_length, start, storage_length): | |||
# seq_length is a 1d tensor indicating the desired length of each sequence | |||
|
|||
if isinstance(seq_length, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also looked into the possibility of speeding up the case where seq_length
is a tensor. It seems a lot less straightforward, and I'm not entirely sure if we can get a speedup comparable to the int
case. Since the sequence lengths are all different, it inherently requires doing something that is equivalent to calling torch.arange
multiple times and torch.cat
ing the results together.
I can continue to investigate if you'd like--I just didn't want to invest too much time in it without discussing it first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Just a nit wrt device
IIRC this
https://github.com/kurtamohler/torchrl/blob/db5f5cff8a67f3854759ab78215046bc65019046/torchrl/data/replay_buffers/samplers.py#L1881
used to be the most expensive thing when the buffer is full (eg 1M elements). Can you reproduce that in your benchmark?
If so a follow-up should be to speed up that one too!
db5f5cf
to
fec4f40
Compare
I see, I will look into that |
Description
Speeds up the
SliceSampler._tensor_slices_from_startend
method by about 8x in the case whereseq_length
is an int.Running the performance measurement script from #2422 (comment) in my machine gives:
whereas the output before the change was:
So this change provides a speedup of about
(0.00467 / 0.00286) = 1.632
to theReplayBuffer.sample
method for the particular case in that script.I also took a performance profile of the script with
cProfile
, like so:I increased the
timeit
iterations from 30 to 3000 for better precision. Before the change, the cumulative time spent in the_tensor_slices_from_startend
function was 7.571. After the change, it was 0.871, so the speedup for_tensor_slices_from_startend
alone was about 8x.Motivation and Context
close #2422
Types of changes
What types of changes does your code introduce? Remove all that do not apply:
Checklist
Go over all the following points, and put an
x
in all the boxes that apply.If you are unsure about any of these, don't hesitate to ask. We are here to help!