Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Memoize col in conv_im2col functions to reduce allocations #635

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Octogonapus
Copy link

@Octogonapus Octogonapus commented Mar 12, 2025

This PR adds a memo in the conv_im2col functions which is used to memoize col (some scratchspace those functions use).
This dramatically reduces allocations for convolutional layers and increases throughput.
There is a new public function free_scratchspace_memo! to clear that memo, should the end user wish to do so once they are done using conv.

Before (using ObjectDetector as a benchmark):

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.908225 seconds (52.34 k allocations: 51.040 GiB, 18.35% gc time)

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.986337 seconds (52.34 k allocations: 51.040 GiB, 17.18% gc time)

After:

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.943823 seconds (53.09 k allocations: 14.850 GiB, 9.25% gc time) # the first call with an empty memo will allocate to fill the memo

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.800984 seconds (53.00 k allocations: 1.101 GiB, 3.18% gc time)

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.732375 seconds (53.00 k allocations: 1.101 GiB, 0.32% gc time)

We saved 50 GiB of allocations!

An alternative design is to plumb the col parameter through lots of intermediate generated functions, which seems quite complex.

PR Checklist

  • Tests are added
  • Documentation, if applicable

@Octogonapus
Copy link
Author

The Lux.jl test failures look unrelated to this PR. The Flux.jl test failures also look unrelated but they are confusing to me as I am not very familiar with that project.

@Octogonapus
Copy link
Author

Eric Hanson noted that this might be poor for GPU workloads which often run out of memory. That's a good point, so maybe this could be configured via scoped values or a global configuration mechanism. I would be interested in what the maintainers think before making changes.

@ToucheSir
Copy link
Member

Rephrasing my thoughts from the Slack thread,

That's a good point, so maybe this could be configured via scoped values

this sounds like a cool idea. Perhaps we could let users plug in their own allocation function or allocator that follows a defined interface?

@Octogonapus
Copy link
Author

Someone also brought up https://juliagpu.github.io/GPUArrays.jl/dev/interface/#Caching-Allocator in that thread which I will look into

@pxl-th
Copy link
Member

pxl-th commented Mar 15, 2025

Someone also brought up https://juliagpu.github.io/GPUArrays.jl/dev/interface/#Caching-Allocator in that thread which I will look into

To implement it for array, you basically have to use GPUArrays.cached_alloc function, where first argument is the function that does actual allocation (if no cache exists) and second is the tuple/key which will be used to find cache for that allocation
You can take a look at how AMDGPU does it:
https://github.com/JuliaGPU/AMDGPU.jl/blob/822da911e34aa319808bf93d61178c45e759df75/src/array.jl#L9

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants