Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

FFT MPI with threads #49

Closed
Lightup1 opened this issue Jun 22, 2022 · 10 comments
Closed

FFT MPI with threads #49

Lightup1 opened this issue Jun 22, 2022 · 10 comments

Comments

@Lightup1
Copy link

Is is possible to combine PencilFFTs with FFTW threads setting?
I mean that PencilFFTs to call MPI processes and each process will perform FFT with multiple threads.
Does it need any specific settings or just launch julia with -t and call FFTW.set_num_threads(Threads.nthreads())

@jipolanco
Copy link
Owner

I haven't tried, but I guess it should just work if you do FFTW.set_num_threads(Threads.nthreads()). Just make sure that MPI is initialised with threading support.

Note that all other operations besides FFTs, such as transpositions in PencilArrays.jl, are not threaded, so using M MPI processes and N threads will likely be slower than just using M×N MPI processes. But let me know how it goes if you try this. It may be worth it to implement threaded versions of certain functions in PencilArrays.jl.

See also the FFTW docs on combining MPI and threads.

@Lightup1
Copy link
Author

I just assume that thread is cheap than mpi so it will have a speed-up if we run things in a combined way.
https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis

@jipolanco
Copy link
Owner

I guess it's worth trying! I'm very curious to know how it compares to pure MPI.

@Lightup1
Copy link
Author

Lightup1 commented Jun 22, 2022

seems not as I expected

using MPI
using PencilFFTs
using FFTW
using Random
using BenchmarkTools


MPI.Init(threadlevel=:funneled)
comm = MPI.COMM_WORLD

FFTW.set_num_threads(Threads.nthreads())

rank=MPI.Comm_rank(comm)
sleep(0.05*rank)
print("rank:",rank,"Threads:",Threads.nthreads(),"\n")

# Input data dimensions (Nx × Ny × Nz)
dims = (5120, 32, 32)
pen = Pencil(dims, comm)
transform=Transforms.FFT!()
if rank == 0
    print("Start data allocationg\n")
end
plan = PencilFFTPlan(pen, transform)
u = allocate_input(plan)
if rank == 0
    print("Complete data allocationg\n")
end

if rank == 0
    print("Start randn data \n")
end
randn!(first(u))
if rank == 0
    print("Complete randn data \n")
end

if rank == 0
    print("Start benchmark \n")
end
b = @benchmark $plan*$u evals=1 samples=100 seconds=60 teardown=(MPI.Barrier(comm))
if rank == 0
    print("Complete benchmark \n")
end

if rank == 0
    io = IOBuffer()
    show(io, "text/plain", b)
    s = String(take!(io))
    println(s)
end

rank:0Threads:9
rank:1Threads:9
rank:2Threads:9
rank:3Threads:9
rank:4Threads:9
rank:5Threads:9
rank:6Threads:9
rank:7Threads:9
Start data allocationg
Complete data allocationg
Start randn data 
Complete randn data 
Start benchmark 
Complete benchmark 
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  8.402 ms … 44.998 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.926 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.290 ms ±  3.613 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▃▇█▄                                             
  ▃▁▁▃▁▁▃▃▃▄▅████▆▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  8.4 ms         Histogram: frequency by time        10.7 ms <

 Memory estimate: 99.42 KiB, allocs estimate: 1143.
rank:0Threads:1
rank:1Threads:1
rank:2Threads:1
rank:3Threads:1
rank:4Threads:1
rank:5Threads:1
rank:6Threads:1
rank:7Threads:1
rank:8Threads:1
rank:9Threads:1
rank:10Threads:1
rank:11Threads:1
rank:12Threads:1
rank:13Threads:1
rank:14Threads:1
rank:15Threads:1
rank:16Threads:1
rank:17Threads:1
rank:18Threads:1
rank:19Threads:1
rank:20Threads:1
rank:21Threads:1
rank:22Threads:1
rank:23Threads:1
rank:24Threads:1
rank:25Threads:1
rank:26Threads:1
rank:27Threads:1
rank:28Threads:1
rank:29Threads:1
rank:30Threads:1
rank:31Threads:1
rank:32Threads:1
rank:33Threads:1
rank:34Threads:1
rank:35Threads:1
rank:36Threads:1
rank:37Threads:1
rank:38Threads:1
rank:39Threads:1
rank:40Threads:1
rank:41Threads:1
rank:42Threads:1
rank:43Threads:1
rank:44Threads:1
rank:45Threads:1
rank:46Threads:1
rank:47Threads:1
rank:48Threads:1
rank:49Threads:1
rank:50Threads:1
rank:51Threads:1
rank:52Threads:1
rank:53Threads:1
rank:54Threads:1
rank:55Threads:1
rank:56Threads:1
rank:57Threads:1
rank:58Threads:1
rank:59Threads:1
rank:60Threads:1
rank:61Threads:1
rank:62Threads:1
rank:63Threads:1
rank:64Threads:1
rank:65Threads:1
rank:66Threads:1
rank:67Threads:1
rank:68Threads:1
rank:69Threads:1
rank:70Threads:1
rank:71Threads:1
Start data allocationg
Complete data allocationg
Start randn data 
Complete randn data 
Start benchmark 
Complete benchmark 
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  3.226 ms … 31.161 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.520 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.007 ms ±  3.021 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▁                                                         
  ███▇█▄▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
  3.23 ms      Histogram: log(frequency) by time     15.8 ms <

 Memory estimate: 28.52 KiB, allocs estimate: 255.

@jipolanco
Copy link
Owner

As I mentioned, combining threads and MPI is likely slower because transpositions are not threaded. And the cost of transpositions can be comparable or larger to that of the FFTs themselves. So I'm not really surprised by these results.

You may want to see where the time is actually spent. You can use TimerOutputs.jl for this. See here for details on how to enable timers for PencilArrays / PencilFFTs functions.

@jipolanco
Copy link
Owner

See also the FFTW docs that I linked above:

This may or may not be faster than simply using as many MPI processes as you have processors, however. On the one hand, using threads within a node eliminates the need for explicit message passing within the node. On the other hand, FFTW’s transpose routines are not multi-threaded, and this means that the communications that do take place will not benefit from parallelization within the node. Moreover, many MPI implementations already have optimizations to exploit shared memory when it is available, so adding the multithreaded FFTW on top of this may be superfluous.

@Lightup1
Copy link
Author

Thanks!

@Lightup1
Copy link
Author

hi @jipolanco I just find that for large number of mpi process @benchmark will stuck there for a very long time until it trigger the wall time I set. Do you know what happened there?

@jipolanco
Copy link
Owner

Hi, unfortunately I don't know what is going on there. It would be good to fix this, and for this we need to know where it's hanging exactly. How many processes are you using? Do you have some minimal code that reproduces the issue (on your machine/cluster)?

@Lightup1
Copy link
Author

#51 I just report a new issue.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants