FFT MPI with threads #49

Lightup1 · 2022-06-22T02:31:16Z

Is is possible to combine PencilFFTs with FFTW threads setting?
I mean that PencilFFTs to call MPI processes and each process will perform FFT with multiple threads.
Does it need any specific settings or just launch julia with -t and call FFTW.set_num_threads(Threads.nthreads())

The text was updated successfully, but these errors were encountered:

jipolanco · 2022-06-22T05:43:42Z

I haven't tried, but I guess it should just work if you do FFTW.set_num_threads(Threads.nthreads()). Just make sure that MPI is initialised with threading support.

Note that all other operations besides FFTs, such as transpositions in PencilArrays.jl, are not threaded, so using M MPI processes and N threads will likely be slower than just using M×N MPI processes. But let me know how it goes if you try this. It may be worth it to implement threaded versions of certain functions in PencilArrays.jl.

See also the FFTW docs on combining MPI and threads.

Lightup1 · 2022-06-22T06:02:37Z

I just assume that thread is cheap than mpi so it will have a speed-up if we run things in a combined way.
https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis

jipolanco · 2022-06-22T06:24:58Z

I guess it's worth trying! I'm very curious to know how it compares to pure MPI.

Lightup1 · 2022-06-22T10:14:08Z

seems not as I expected

using MPI
using PencilFFTs
using FFTW
using Random
using BenchmarkTools


MPI.Init(threadlevel=:funneled)
comm = MPI.COMM_WORLD

FFTW.set_num_threads(Threads.nthreads())

rank=MPI.Comm_rank(comm)
sleep(0.05*rank)
print("rank:",rank,"Threads:",Threads.nthreads(),"\n")

# Input data dimensions (Nx × Ny × Nz)
dims = (5120, 32, 32)
pen = Pencil(dims, comm)
transform=Transforms.FFT!()
if rank == 0
    print("Start data allocationg\n")
end
plan = PencilFFTPlan(pen, transform)
u = allocate_input(plan)
if rank == 0
    print("Complete data allocationg\n")
end

if rank == 0
    print("Start randn data \n")
end
randn!(first(u))
if rank == 0
    print("Complete randn data \n")
end

if rank == 0
    print("Start benchmark \n")
end
b = @benchmark $plan*$u evals=1 samples=100 seconds=60 teardown=(MPI.Barrier(comm))
if rank == 0
    print("Complete benchmark \n")
end

if rank == 0
    io = IOBuffer()
    show(io, "text/plain", b)
    s = String(take!(io))
    println(s)
end

rank:0Threads:9
rank:1Threads:9
rank:2Threads:9
rank:3Threads:9
rank:4Threads:9
rank:5Threads:9
rank:6Threads:9
rank:7Threads:9
Start data allocationg
Complete data allocationg
Start randn data 
Complete randn data 
Start benchmark 
Complete benchmark 
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  8.402 ms … 44.998 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.926 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.290 ms ±  3.613 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▃▇█▄                                             
  ▃▁▁▃▁▁▃▃▃▄▅████▆▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  8.4 ms         Histogram: frequency by time        10.7 ms <

 Memory estimate: 99.42 KiB, allocs estimate: 1143.

rank:0Threads:1
rank:1Threads:1
rank:2Threads:1
rank:3Threads:1
rank:4Threads:1
rank:5Threads:1
rank:6Threads:1
rank:7Threads:1
rank:8Threads:1
rank:9Threads:1
rank:10Threads:1
rank:11Threads:1
rank:12Threads:1
rank:13Threads:1
rank:14Threads:1
rank:15Threads:1
rank:16Threads:1
rank:17Threads:1
rank:18Threads:1
rank:19Threads:1
rank:20Threads:1
rank:21Threads:1
rank:22Threads:1
rank:23Threads:1
rank:24Threads:1
rank:25Threads:1
rank:26Threads:1
rank:27Threads:1
rank:28Threads:1
rank:29Threads:1
rank:30Threads:1
rank:31Threads:1
rank:32Threads:1
rank:33Threads:1
rank:34Threads:1
rank:35Threads:1
rank:36Threads:1
rank:37Threads:1
rank:38Threads:1
rank:39Threads:1
rank:40Threads:1
rank:41Threads:1
rank:42Threads:1
rank:43Threads:1
rank:44Threads:1
rank:45Threads:1
rank:46Threads:1
rank:47Threads:1
rank:48Threads:1
rank:49Threads:1
rank:50Threads:1
rank:51Threads:1
rank:52Threads:1
rank:53Threads:1
rank:54Threads:1
rank:55Threads:1
rank:56Threads:1
rank:57Threads:1
rank:58Threads:1
rank:59Threads:1
rank:60Threads:1
rank:61Threads:1
rank:62Threads:1
rank:63Threads:1
rank:64Threads:1
rank:65Threads:1
rank:66Threads:1
rank:67Threads:1
rank:68Threads:1
rank:69Threads:1
rank:70Threads:1
rank:71Threads:1
Start data allocationg
Complete data allocationg
Start randn data 
Complete randn data 
Start benchmark 
Complete benchmark 
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  3.226 ms … 31.161 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.520 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.007 ms ±  3.021 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▁                                                         
  ███▇█▄▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
  3.23 ms      Histogram: log(frequency) by time     15.8 ms <

 Memory estimate: 28.52 KiB, allocs estimate: 255.

jipolanco · 2022-06-22T11:20:26Z

As I mentioned, combining threads and MPI is likely slower because transpositions are not threaded. And the cost of transpositions can be comparable or larger to that of the FFTs themselves. So I'm not really surprised by these results.

You may want to see where the time is actually spent. You can use TimerOutputs.jl for this. See here for details on how to enable timers for PencilArrays / PencilFFTs functions.

jipolanco · 2022-06-22T11:22:18Z

See also the FFTW docs that I linked above:

This may or may not be faster than simply using as many MPI processes as you have processors, however. On the one hand, using threads within a node eliminates the need for explicit message passing within the node. On the other hand, FFTW’s transpose routines are not multi-threaded, and this means that the communications that do take place will not benefit from parallelization within the node. Moreover, many MPI implementations already have optimizations to exploit shared memory when it is available, so adding the multithreaded FFTW on top of this may be superfluous.

Lightup1 · 2022-06-22T11:28:40Z

Thanks!

Lightup1 · 2022-06-22T13:46:59Z

hi @jipolanco I just find that for large number of mpi process @benchmark will stuck there for a very long time until it trigger the wall time I set. Do you know what happened there?

jipolanco · 2022-06-22T14:52:50Z

Hi, unfortunately I don't know what is going on there. It would be good to fix this, and for this we need to know where it's hanging exactly. How many processes are you using? Do you have some minimal code that reproduces the issue (on your machine/cluster)?

Lightup1 · 2022-06-22T16:01:53Z

#51 I just report a new issue.

Lightup1 closed this as completed Jun 22, 2022

jipolanco mentioned this issue Jul 21, 2022

CuArray Performance #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FFT MPI with threads #49

FFT MPI with threads #49

Lightup1 commented Jun 22, 2022

jipolanco commented Jun 22, 2022

Lightup1 commented Jun 22, 2022

jipolanco commented Jun 22, 2022

Lightup1 commented Jun 22, 2022 •

edited

Loading

jipolanco commented Jun 22, 2022

jipolanco commented Jun 22, 2022

Lightup1 commented Jun 22, 2022

Lightup1 commented Jun 22, 2022

jipolanco commented Jun 22, 2022

Lightup1 commented Jun 22, 2022

FFT MPI with threads #49

FFT MPI with threads #49

Comments

Lightup1 commented Jun 22, 2022

jipolanco commented Jun 22, 2022

Lightup1 commented Jun 22, 2022

jipolanco commented Jun 22, 2022

Lightup1 commented Jun 22, 2022 • edited Loading

jipolanco commented Jun 22, 2022

jipolanco commented Jun 22, 2022

Lightup1 commented Jun 22, 2022

Lightup1 commented Jun 22, 2022

jipolanco commented Jun 22, 2022

Lightup1 commented Jun 22, 2022

Lightup1 commented Jun 22, 2022 •

edited

Loading