[Performance] Faster `SliceSampler._tensor_slices_from_startend` #2423

kurtamohler · 2024-09-06T18:50:06Z

Description

Speeds up the SliceSampler._tensor_slices_from_startend method by about 8x in the case where seq_length is an int.

Running the performance measurement script from #2422 (comment) in my machine gives:

Without SliceSampler: 0.00019115543303390345 s
With SliceSampler: 0.002860310899753434 s
Slowdown factor: 14.963272842190824x

whereas the output before the change was:

Without SliceSampler: 0.0001870056662786131 s
With SliceSampler: 0.0046725646670286855 s
Slowdown factor: 24.98621972270721x

So this change provides a speedup of about (0.00467 / 0.00286) = 1.632 to the ReplayBuffer.sample method for the particular case in that script.

I also took a performance profile of the script with cProfile, like so:

pthon -m cProfile <script> | grep _tensor_slices_from_startend

I increased the timeit iterations from 30 to 3000 for better precision. Before the change, the cumulative time spent in the _tensor_slices_from_startend function was 7.571. After the change, it was 0.871, so the speedup for _tensor_slices_from_startend alone was about 8x.

Motivation and Context

close #2422

I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Remove all that do not apply:

Bug fix (non-breaking change which fixes an issue)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide (required)

pytorch-bot · 2024-09-06T18:50:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2423

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 7 Unrelated Failures

As of commit fec4f40 with merge base 57f0580 ():

NEW FAILURES - The following jobs have failed:

Habitat Tests on Linux / tests (3.9, 12.1) / linux-job (gh)
RuntimeError: Command docker exec -t b9ddedc0cf13e6053e034443c3ff1e0832f6f46e1e4d7aeee9696b2348903272 /exec failed with exit code 139
Libs Tests on Linux / unittests-gym (3.9, 12.1) / linux-job (gh)
RuntimeError: Command docker exec -t 8bba65af59da3726a5a193b8fcb415ffebe1f24a2b69486c784c90268622ae8e /exec failed with exit code 1

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

Continuous Benchmark (PR) / CPU Pytest benchmark (gh) (detected as infra flaky with no log or failing log classifier)
Continuous Benchmark (PR) / GPU Pytest benchmark (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kurtamohler · 2024-09-06T18:59:06Z

torchrl/data/replay_buffers/samplers.py

@@ -1076,9 +1076,24 @@ def _tensor_slices_from_startend(self, seq_length, start, storage_length):
        # seq_length is a 1d tensor indicating the desired length of each sequence

        if isinstance(seq_length, int):


I also looked into the possibility of speeding up the case where seq_length is a tensor. It seems a lot less straightforward, and I'm not entirely sure if we can get a speedup comparable to the int case. Since the sequence lengths are all different, it inherently requires doing something that is equivalent to calling torch.arange multiple times and torch.cating the results together.

I can continue to investigate if you'd like--I just didn't want to invest too much time in it without discussing it first

vmoens

LGTM!
Just a nit wrt device

IIRC this
https://github.com/kurtamohler/torchrl/blob/db5f5cff8a67f3854759ab78215046bc65019046/torchrl/data/replay_buffers/samplers.py#L1881
used to be the most expensive thing when the buffer is full (eg 1M elements). Can you reproduce that in your benchmark?
If so a follow-up should be to speed up that one too!

torchrl/data/replay_buffers/samplers.py

kurtamohler · 2024-09-09T20:36:20Z

IIRC this https://github.com/kurtamohler/torchrl/blob/db5f5cff8a67f3854759ab78215046bc65019046/torchrl/data/replay_buffers/samplers.py#L1881 used to be the most expensive thing when the buffer is full (eg 1M elements). Can you reproduce that in your benchmark? If so a follow-up should be to speed up that one too!

I see, I will look into that

kurtamohler added the performance Performance issue or suggestion for improvement label Sep 6, 2024

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 6, 2024

kurtamohler commented Sep 6, 2024

View reviewed changes

vmoens reviewed Sep 9, 2024

View reviewed changes

torchrl/data/replay_buffers/samplers.py Outdated Show resolved Hide resolved

vmoens approved these changes Sep 9, 2024

View reviewed changes

[Performance] Faster SliceSampler._tensor_slices_from_startend

fec4f40

kurtamohler force-pushed the speedup-SliceSampler-0 branch from db5f5cf to fec4f40 Compare September 9, 2024 20:33

vmoens merged commit 6aa4b53 into pytorch:main Sep 10, 2024
65 of 70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Faster `SliceSampler._tensor_slices_from_startend` #2423

[Performance] Faster `SliceSampler._tensor_slices_from_startend` #2423

kurtamohler commented Sep 6, 2024 •

edited

Loading

pytorch-bot bot commented Sep 6, 2024 •

edited

Loading

kurtamohler Sep 6, 2024

vmoens left a comment

kurtamohler commented Sep 9, 2024

		@@ -1076,9 +1076,24 @@ def _tensor_slices_from_startend(self, seq_length, start, storage_length):
		# seq_length is a 1d tensor indicating the desired length of each sequence

		if isinstance(seq_length, int):

[Performance] Faster SliceSampler._tensor_slices_from_startend #2423

[Performance] Faster SliceSampler._tensor_slices_from_startend #2423

Conversation

kurtamohler commented Sep 6, 2024 • edited Loading

Description

Motivation and Context

Types of changes

Checklist

pytorch-bot bot commented Sep 6, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2423

❌ 2 New Failures, 7 Unrelated Failures

kurtamohler Sep 6, 2024

Choose a reason for hiding this comment

vmoens left a comment

Choose a reason for hiding this comment

kurtamohler commented Sep 9, 2024

[Performance] Faster `SliceSampler._tensor_slices_from_startend` #2423

[Performance] Faster `SliceSampler._tensor_slices_from_startend` #2423

kurtamohler commented Sep 6, 2024 •

edited

Loading

pytorch-bot bot commented Sep 6, 2024 •

edited

Loading