-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
preallocation for multiplication #108
Comments
If you have a GBMatrix C that has the right internal storage type then yes you can do |
Note that almost every operation supports a bang ( |
I observed the following (counterintuitive) behavior: julia> using SuiteSparseGraphBLAS, SparseArrays, BenchmarkTools
julia> X = GBMatrix( sprand(10, 1000, 0.9) );
julia> # without pre-allocation
@btime $(X') * $X;
34.615 ms (9 allocations: 8.58 MiB)
julia> # use the correct type for preallocation
D = X' * X;
julia> # with preallocation
@btime mul!( $D, $(X'), $X );
31.978 ms (4 allocations: 8.58 MiB)
Although the number of allocations is halved, the amount of data allocated is practically the same. It is unclear why there are still 8.58 MiB of allocations made. The format of |
@fcdimitr this is sort of a peculiarity of GraphBLAS. If you try the final invocation with @DrTimothyAldenDavis do you have any advice on the comment directly above this one? Masking with |
I would have to see the underlying calls to GraphBLAS. I can only do things in place if the output is full, and the accum operator is used (as in C += A*B where C is full). But D=X'*X where X is sparse should normally lead to a sparse matrix D, not a full one. |
Oh duh @fcdimitr this is a full matrix on output, I thought this was sparse. @DrTimothyAldenDavis despite it being full there are still allocations, as expected. I guess we'll have to wait on JIT, though, to allow |
It's full by chance. I have to figure out that it's full, so I have to compute the pattern first. Then I realize it's full and then I convert the result to full. |
I have to do this even with a JIT. If instead you want to force the output to be full, you can do this: C = full matrix of the right size, all zero. then in principle, the 'dot2' method could exploit the accumulator. But I don't think I have a kernel that exploits this case yet. |
Thank you both, @DrTimothyAldenDavis and @Wimmerer! Your suggestions successfully address the issue of unnecessary space allocation in the dense output case. However, when dealing with sparser matrices and expecting a sparse output, I noticed that a significant amount of allocation is still happening despite my efforts to enforce the desired output format, order, and sparsity pattern. I am adding below a new benchmark script: julia> using SuiteSparseGraphBLAS, SparseArrays, BenchmarkTools, Random
julia> Random.seed!(0); # reproducibility
julia> X = GBMatrix( sprand(10, 1000, 0.9) );
julia> # without pre-allocation
@btime $(X') * $X;
31.497 ms (9 allocations: 8.58 MiB)
julia> # use the correct type for preallocation
D = X' * X;
julia> # with preallocation
@btime mul!( $D, $(X'), $X );
32.156 ms (4 allocations: 8.58 MiB)
julia> # enforce accum to be + to avoid allocations
@btime mul!( $D, $(X'), $X; accum = + ) setup=( D=GBMatrix(zeros(size(X,2), size(X,2))) );
29.042 ms (6 allocations: 272 bytes)
julia> ## Let's check the same with sparser matrices (and output)
X = GBMatrix( sprand(10, 1000, 0.1) );
julia> # without pre-allocation
@btime $(X') * $X;
923.561 μs (17 allocations: 1.50 MiB)
julia> # use the correct type for preallocation
D = X' * X;
julia> # with preallocation
@btime mul!( $D, $(X'), $X );
911.409 μs (12 allocations: 1.50 MiB)
julia> # empty out D, but keep the correct sparsity/format
# (maybe there is a better way to do this?)
ptr,idx,nzval,repack! = SuiteSparseGraphBLAS.tempunpack!(D);
julia> nzval .= 0;
julia> repack!();
julia> # enforce accum to be + to avoid allocations
@btime mul!( Dtmp, $(X'), $X; accum = + ) setup=( Dtmp=copy(D) );
3.278 ms (25 allocations: 3.02 MiB)
julia> # enforce accum to be + to avoid allocations using mask
M = SuiteSparseGraphBLAS.Structural(D);
julia> @btime mul!( Dtmp, $(X'), $X; accum = +, mask = $M ) setup=( Dtmp=copy(D) );
5.781 ms (15 allocations: 1.50 MiB)
I suspect this additional allocation occurs because the library needs to determine the pattern of Thank you very much! PS: Tested under
|
That's a smart way to set everything to 0 without making them iso! Good find. Unfortunately I don't think this kernel exists yet. With the new JIT it will be somewhat easier to add but it's a rare case. I'll try to keep this use case in mind and look for some solutions. On the Julia side and the SSGrB side (for instance I'm working on a way to use Finch.jl to fill in missing kernels so we get Tim's Uber fast hand tuned code with (still fast) autogenerated code for less common kernels). Relatedly MKLSparse has the inspector executor routines which may help (and I plan to support those too). Tim's codes are pretty much always faster (often by 5x+), but if you're allocation sensitive then the MKL codes might be useful. I can add a package extension to MKLSparse to make passing GBMatrix easier! |
I've tried MKL Sparse and GraphBLAS is often much faster. But perhaps the preallocation would differ. If there is an expected sparsity pattern of the result, it might be possible to use the pattern of the result itself as its own mask. This would be C<C,structural> += whatever. The accumulator also tells me that no entry in the output can be deleted. I exploit this kind of thing for GrB_assign, but not for GrB_mxm where this use case would be rare. I would need to write a new kernel to exploit this particular case, and to reduce the number of reallocations. This is something I could consider in the future. We would need to categorize all the use cases in Julia where this sort of thing occurs, map them to GraphBLAS, and then make sure my kernels exploit those particular use cases. |
Yeah I would only recommend the use of Removing allocations by specifying |
Preallocation isn't currently planned for this library. It may be possible in the future but the spec doesn't really make room for pre-allocation. |
For a GBMatrix A (dense) and a GBMatrix B (sparse) multiplication, is it possible to do pre-allocation or inplace opereation to further minimize the time?
The text was updated successfully, but these errors were encountered: