Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Track benchmarking #16

Open
ElliottKasoar opened this issue Oct 2, 2023 · 6 comments · May be fixed by #34
Open

Track benchmarking #16

ElliottKasoar opened this issue Oct 2, 2023 · 6 comments · May be fixed by #34
Assignees

Comments

@ElliottKasoar
Copy link
Contributor

Since #11 will hopefully be merged relatively soon, we can keep track of new benchmarks here.

I intend to rebuild to double check versions for everything, but my first set of results on CSD3, run interactively on a single Cascade Lake node (sintr -p cclake -N1 -n8 -t 1:0:0 --qos=INTR), with:

  • module load gcc/9 (9.3.0)
  • module load python/3.8 (3.8.2)
  • torch==2.0.1+cpu

are not particularly conclusive:

==> cgdrag_forpy_1.out <==
min    time taken (s):     0.1853 [omp]
max    time taken (s):     0.2262 [omp]
mean   time taken (s):     0.2115 [omp]
stddev time taken (s):     0.0041 [omp]
sample size          :        999

==> cgdrag_torch_1.out <==
min    time taken (s):     0.1833 [omp]
max    time taken (s):     0.2357 [omp]
mean   time taken (s):     0.2213 [omp]
stddev time taken (s):     0.0060 [omp]
sample size          :        999

==> cgdrag_forpy_4.out <==
min    time taken (s):     0.0564 [omp]
max    time taken (s):     0.0737 [omp]
mean   time taken (s):     0.0632 [omp]
stddev time taken (s):     0.0029 [omp]
sample size          :        999

==> cgdrag_torch_4.out <==
min    time taken (s):     0.0556 [omp]
max    time taken (s):     0.0806 [omp]
mean   time taken (s):     0.0650 [omp]
stddev time taken (s):     0.0036 [omp]
sample size          :        999

==> cgdrag_forpy_8.out <==
min    time taken (s):     0.0369 [omp]
max    time taken (s):     0.0493 [omp]
mean   time taken (s):     0.0404 [omp]
stddev time taken (s):     0.0024 [omp]
sample size          :        999

==> cgdrag_torch_8.out <==
min    time taken (s):     0.0304 [omp]
max    time taken (s):     0.0444 [omp]
mean   time taken (s):     0.0365 [omp]
stddev time taken (s):     0.0023 [omp]
sample size          :        999
@ElliottKasoar
Copy link
Contributor Author

ElliottKasoar commented Oct 5, 2023

Rebuilt (versions as described above) and submitted as a job to a Cascade Lake node, using my update_resnet_example branch:

For the single image given to ResNet, FTorch seems comparable, perhaps slightly faster:

==> resnet_forpy_1.out <==
min    time taken (s):     0.0394 [omp]
max    time taken (s):     0.0565 [omp]
mean   time taken (s):     0.0420 [omp]
stddev time taken (s):     0.0019 [omp]
sample size          :        999

==> resnet_torch_1.out <==
min    time taken (s):     0.0396 [omp]
max    time taken (s):     0.0894 [omp]
mean   time taken (s):     0.0426 [omp]
stddev time taken (s):     0.0035 [omp]
sample size          :        999

==> resnet_forpy_4.out <==
min    time taken (s):     0.0138 [omp]
max    time taken (s):     0.0236 [omp]
mean   time taken (s):     0.0148 [omp]
stddev time taken (s):     0.0012 [omp]
sample size          :        999

==> resnet_torch_4.out <==
min    time taken (s):     0.0133 [omp]
max    time taken (s):     0.0645 [omp]
mean   time taken (s):     0.0138 [omp]
stddev time taken (s):     0.0018 [omp]
sample size          :        999

==> resnet_forpy_8.out <==
min    time taken (s):     0.0110 [omp]
max    time taken (s):     0.9599 [omp]
mean   time taken (s):     0.0157 [omp]
stddev time taken (s):     0.0593 [omp]
sample size          :        999

==> resnet_torch_8.out <==
min    time taken (s):     0.0100 [omp]
max    time taken (s):     0.9410 [omp]
mean   time taken (s):     0.0143 [omp]
stddev time taken (s):     0.0559 [omp]
sample size          :        999

For large stride, FTorch seems significantly faster:

==> ls_forpy_1.out <==
min    time taken (s):     0.7955 [omp]
max    time taken (s):     0.9900 [omp]
mean   time taken (s):     0.8092 [omp]
stddev time taken (s):     0.0218 [omp]
sample size          :        999

==> ls_torch_1.out <==
min    time taken (s):     0.4481 [omp]
max    time taken (s):     0.5623 [omp]
mean   time taken (s):     0.4557 [omp]
stddev time taken (s):     0.0132 [omp]
sample size          :        999

==> ls_forpy_4.out <==
min    time taken (s):     0.7988 [omp]
max    time taken (s):     0.9900 [omp]
mean   time taken (s):     0.8123 [omp]
stddev time taken (s):     0.0181 [omp]
sample size          :        999

==> ls_torch_4.out <==
min    time taken (s):     0.4499 [omp]
max    time taken (s):     0.5620 [omp]
mean   time taken (s):     0.4582 [omp]
stddev time taken (s):     0.0127 [omp]
sample size          :        999

==> ls_forpy_8.out <==
min    time taken (s):     0.7779 [omp]
max    time taken (s):     0.9813 [omp]
mean   time taken (s):     0.7975 [omp]
stddev time taken (s):     0.0235 [omp]
sample size          :        999

==> ls_torch_8.out <==
min    time taken (s):     0.4610 [omp]
max    time taken (s):     0.5939 [omp]
mean   time taken (s):     0.4777 [omp]
stddev time taken (s):     0.0172 [omp]
sample size          :        999

For cgdrag, FTorch is seems worse, although less so than seemed to be the case in #11 (<10%).

==> cgdrag_forpy_1.out <==
min    time taken (s):     0.1295 [omp]
max    time taken (s):     0.1756 [omp]
mean   time taken (s):     0.1379 [omp]
stddev time taken (s):     0.0083 [omp]
sample size          :        999

==> cgdrag_torch_1.out <==
min    time taken (s):     0.1391 [omp]
max    time taken (s):     0.1919 [omp]
mean   time taken (s):     0.1504 [omp]
stddev time taken (s):     0.0092 [omp]
sample size          :        999

==> cgdrag_forpy_4.out <==
min    time taken (s):     0.0415 [omp]
max    time taken (s):     0.0577 [omp]
mean   time taken (s):     0.0438 [omp]
stddev time taken (s):     0.0022 [omp]
sample size          :        999

==> cgdrag_torch_4.out <==
min    time taken (s):     0.0424 [omp]
max    time taken (s):     0.0636 [omp]
mean   time taken (s):     0.0452 [omp]
stddev time taken (s):     0.0031 [omp]
sample size          :        999

==> cgdrag_forpy_8.out <==
min    time taken (s):     0.0247 [omp]
max    time taken (s):     0.0372 [omp]
mean   time taken (s):     0.0266 [omp]
stddev time taken (s):     0.0013 [omp]
sample size          :        999

==> cgdrag_torch_8.out <==
min    time taken (s):     0.0259 [omp]
max    time taken (s):     1.4784 [omp]
mean   time taken (s):     0.0295 [omp]
stddev time taken (s):     0.0473 [omp]
sample size          :        999

@ElliottKasoar
Copy link
Contributor Author

Same binaries, but on an Icelake node:

ResNet - very similar, with FTorch very slightly faster with more threads:

==> resnet_forpy_1.out <==
min    time taken (s):     0.0586 [omp]
max    time taken (s):     0.0689 [omp]
mean   time taken (s):     0.0642 [omp]
stddev time taken (s):     0.0016 [omp]
sample size          :        999

==> resnet_torch_1.out <==
min    time taken (s):     0.0587 [omp]
max    time taken (s):     0.1012 [omp]
mean   time taken (s):     0.0645 [omp]
stddev time taken (s):     0.0021 [omp]
sample size          :        999

==> resnet_forpy_4.out <==
min    time taken (s):     0.0185 [omp]
max    time taken (s):     0.0430 [omp]
mean   time taken (s):     0.0210 [omp]
stddev time taken (s):     0.0016 [omp]
sample size          :        999

==> resnet_torch_4.out <==
min    time taken (s):     0.0181 [omp]
max    time taken (s):     0.0621 [omp]
mean   time taken (s):     0.0206 [omp]
stddev time taken (s):     0.0020 [omp]
sample size          :        999

==> resnet_forpy_8.out <==
min    time taken (s):     0.0114 [omp]
max    time taken (s):     0.0388 [omp]
mean   time taken (s):     0.0125 [omp]
stddev time taken (s):     0.0012 [omp]
sample size          :        999

==> resnet_torch_8.out <==
min    time taken (s):     0.0109 [omp]
max    time taken (s):     0.0567 [omp]
mean   time taken (s):     0.0123 [omp]
stddev time taken (s):     0.0020 [omp]
sample size          :        999

Large stride - FTorch significantly faster:

==> ls_forpy_1.out <==
min    time taken (s):     0.6752 [omp]
max    time taken (s):     0.7702 [omp]
mean   time taken (s):     0.7423 [omp]
stddev time taken (s):     0.0055 [omp]
sample size          :        999

==> ls_torch_1.out <==
min    time taken (s):     0.4092 [omp]
max    time taken (s):     0.4280 [omp]
mean   time taken (s):     0.4169 [omp]
stddev time taken (s):     0.0029 [omp]
sample size          :        999

==> ls_forpy_4.out <==
min    time taken (s):     0.7461 [omp]
max    time taken (s):     0.7749 [omp]
mean   time taken (s):     0.7569 [omp]
stddev time taken (s):     0.0044 [omp]
sample size          :        999

==> ls_torch_4.out <==
min    time taken (s):     0.4074 [omp]
max    time taken (s):     0.4224 [omp]
mean   time taken (s):     0.4134 [omp]
stddev time taken (s):     0.0029 [omp]
sample size          :        999

==> ls_forpy_8.out <==
min    time taken (s):     0.6778 [omp]
max    time taken (s):     0.7631 [omp]
mean   time taken (s):     0.7475 [omp]
stddev time taken (s):     0.0055 [omp]
sample size          :        999

==> ls_torch_8.out <==
min    time taken (s):     0.4102 [omp]
max    time taken (s):     0.4417 [omp]
mean   time taken (s):     0.4202 [omp]
stddev time taken (s):     0.0043 [omp]
sample size          :        999

cgdrag -smaller (<5%) differences, mostly with FTorch slower, although FTorch fractionally faster with OMP_NUM_THREADS=8:

==> cgdrag_forpy_1.out <==
min    time taken (s):     0.1382 [omp]
max    time taken (s):     0.1780 [omp]
mean   time taken (s):     0.1611 [omp]
stddev time taken (s):     0.0085 [omp]
sample size          :        999

==> cgdrag_torch_1.out <==
min    time taken (s):     0.1431 [omp]
max    time taken (s):     0.2023 [omp]
mean   time taken (s):     0.1619 [omp]
stddev time taken (s):     0.0071 [omp]
sample size          :        999

==> cgdrag_forpy_4.out <==
min    time taken (s):     0.0376 [omp]
max    time taken (s):     0.0549 [omp]
mean   time taken (s):     0.0450 [omp]
stddev time taken (s):     0.0040 [omp]
sample size          :        999

==> cgdrag_torch_4.out <==
min    time taken (s):     0.0405 [omp]
max    time taken (s):     0.0575 [omp]
mean   time taken (s):     0.0473 [omp]
stddev time taken (s):     0.0029 [omp]
sample size          :        999

==> cgdrag_forpy_8.out <==
min    time taken (s):     0.0194 [omp]
max    time taken (s):     0.0410 [omp]
mean   time taken (s):     0.0295 [omp]
stddev time taken (s):     0.0043 [omp]
sample size          :        999

==> cgdrag_torch_8.out <==
min    time taken (s):     0.0239 [omp]
max    time taken (s):     0.0495 [omp]
mean   time taken (s):     0.0293 [omp]
stddev time taken (s):     0.0026 [omp]
sample size          :        999

I'll try rebuilding with Intel compilers next?

@ElliottKasoar
Copy link
Contributor Author

ElliottKasoar commented Oct 6, 2023

Swapping out the Fortran compiler seems to give a similar picture, although they generally seem to be slower:

  • Single Cascade Lake node
  • ifort (IFORT) 19.0.4.243
  • Python 3.8.2
  • gcc (Spack GCC) 11.2.0

ResNet:

==> resnet_forpy_1.out <==
min    time taken (s):     0.0441 [omp]
max    time taken (s):     0.0592 [omp]
mean   time taken (s):     0.0461 [omp]
stddev time taken (s):     0.0015 [omp]
sample size          :        999

==> resnet_torch_1.out <==
min    time taken (s):     0.0443 [omp]
max    time taken (s):     0.1004 [omp]
mean   time taken (s):     0.0463 [omp]
stddev time taken (s):     0.0023 [omp]
sample size          :        999

==> resnet_forpy_4.out <==
min    time taken (s):     0.0166 [omp]
max    time taken (s):     0.0324 [omp]
mean   time taken (s):     0.0174 [omp]
stddev time taken (s):     0.0009 [omp]
sample size          :        999

==> resnet_torch_4.out <==
min    time taken (s):     0.0168 [omp]
max    time taken (s):     0.0701 [omp]
mean   time taken (s):     0.0171 [omp]
stddev time taken (s):     0.0018 [omp]
sample size          :        999

==> resnet_forpy_8.out <==
min    time taken (s):     0.0117 [omp]
max    time taken (s):     0.0368 [omp]
mean   time taken (s):     0.0122 [omp]
stddev time taken (s):     0.0012 [omp]
sample size          :        999

==> resnet_torch_8.out <==
min    time taken (s):     0.0107 [omp]
max    time taken (s):     0.0787 [omp]
mean   time taken (s):     0.0118 [omp]
stddev time taken (s):     0.0036 [omp]
sample size          :        999

Large Stride:

==> ls_forpy_1.out <==
min    time taken (s):     0.7967 [omp]
max    time taken (s):     0.8916 [omp]
mean   time taken (s):     0.8244 [omp]
stddev time taken (s):     0.0123 [omp]
sample size          :        999

==> ls_torch_1.out <==
min    time taken (s):     0.4493 [omp]
max    time taken (s):     0.5078 [omp]
mean   time taken (s):     0.4709 [omp]
stddev time taken (s):     0.0076 [omp]
sample size          :        999

==> ls_forpy_4.out <==
min    time taken (s):     0.8243 [omp]
max    time taken (s):     0.9444 [omp]
mean   time taken (s):     0.8612 [omp]
stddev time taken (s):     0.0134 [omp]
sample size          :        999

==> ls_torch_4.out <==
min    time taken (s):     0.4684 [omp]
max    time taken (s):     0.5117 [omp]
mean   time taken (s):     0.4847 [omp]
stddev time taken (s):     0.0060 [omp]
sample size          :        999

==> ls_forpy_8.out <==
min    time taken (s):     0.8721 [omp]
max    time taken (s):     0.9989 [omp]
mean   time taken (s):     0.9091 [omp]
stddev time taken (s):     0.0127 [omp]
sample size          :        999

==> ls_torch_8.out <==
min    time taken (s):     0.4516 [omp]
max    time taken (s):     0.5083 [omp]
mean   time taken (s):     0.4616 [omp]
stddev time taken (s):     0.0076 [omp]
sample size          :        999

cgdrag:

==> cgdrag_forpy_1.out <==
min    time taken (s):     0.1446 [omp]
max    time taken (s):     0.1611 [omp]
mean   time taken (s):     0.1490 [omp]
stddev time taken (s):     0.0034 [omp]
sample size          :        999

==> cgdrag_torch_1.out <==
min    time taken (s):     0.1564 [omp]
max    time taken (s):     0.2209 [omp]
mean   time taken (s):     0.1696 [omp]
stddev time taken (s):     0.0121 [omp]
sample size          :        999

==> cgdrag_forpy_4.out <==
min    time taken (s):     0.0420 [omp]
max    time taken (s):     0.0578 [omp]
mean   time taken (s):     0.0437 [omp]
stddev time taken (s):     0.0012 [omp]
sample size          :        999

==> cgdrag_torch_4.out <==
min    time taken (s):     0.0460 [omp]
max    time taken (s):     0.0687 [omp]
mean   time taken (s):     0.0482 [omp]
stddev time taken (s):     0.0015 [omp]
sample size          :        999

==> cgdrag_forpy_8.out <==
min    time taken (s):     0.0317 [omp]
max    time taken (s):     0.0438 [omp]
mean   time taken (s):     0.0326 [omp]
stddev time taken (s):     0.0011 [omp]
sample size          :        999

==> cgdrag_torch_8.out <==
min    time taken (s):     0.0313 [omp]
max    time taken (s):     0.0761 [omp]
mean   time taken (s):     0.0325 [omp]
stddev time taken (s):     0.0024 [omp]
sample size          :        999

@TomMelt
Copy link
Member

TomMelt commented Oct 11, 2023

Thanks @ElliottKasoar this looks great. Can we expand the timing functionality to time:

  • initialization of model
  • running of forward model (<- currently what we time)
  • deletion of model objects

For now, we can focus on cgdrag and resnet tests only.

@TomMelt
Copy link
Member

TomMelt commented Oct 11, 2023

We also need to remember, once we have added better timing, we need to compare forpy running with TorchScript compiled model and calling python environment. This is enabled/disabled using CMake flag -D USETS=1 (/-D USETS=0)

@TomMelt
Copy link
Member

TomMelt commented Nov 6, 2023

To close this issue we can produce a notebook with some test results for ResNet and cgdrag models, once we have increased the significant figures of timings

@ElliottKasoar ElliottKasoar linked a pull request Nov 20, 2023 that will close this issue
4 tasks
@TomMelt TomMelt self-assigned this Jan 15, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants