Skip to content
This repository has been archived by the owner on Jun 25, 2022. It is now read-only.

link libnvblas? #17

Open
cboettig opened this issue Mar 9, 2019 · 12 comments
Open

link libnvblas? #17

cboettig opened this issue Mar 9, 2019 · 12 comments

Comments

@cboettig
Copy link
Member

cboettig commented Mar 9, 2019

libnvblas.so gets installed with the existing cuda libraries. Apparently this can be enabled as the drop-in BLAS library for R, and is smart enough to let openblas handle things and only take over when it can provide significant acceleration(?)

EDIT

Haven't found great documentation on setup or performance, but looks like this can be done as a one-off at runtime by setting LD_PRELOAD and configuring the fallback to openblas:

## create config file:
echo "NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/libopenblas.so
NVBLAS_GPU_LIST ALL" > /etc/nvblas.conf

Run R with these env vars:

NVBLAS_CONFIG_FILE=/etc/nvblas.conf LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so.9.0 R

Will have to benchmark a bit, but maybe worth adding this into our cuda/base setup @noamross ?

@eddelbuettel
Copy link
Member

Will have to benchmark a bit

Seconded. We should definitely document that it is there, but I am not convinced it will always be a winner. Then again I am also often wrong when guessing :)

@cboettig
Copy link
Member Author

cboettig commented Mar 9, 2019

Yeah, it's not clear to me what the appropriate benchmark comparison is -- obviously the difference between a given operation on GPU vs CPU depends a lot on exactly what GPU vs what CPU you have on the platform.

That said, I imagine people will really only be deploying the rocker/cuda images on machines with significant GPUs available, if not on hardware explicitly optimized for GPU use (e.g. GPU-type instances on AWS). I do see some substantial improvement in low-level linear algebra operations, things like calculating determinant can see a factor of 10. For typical R use I doubt a lot of operations would see things like that, but then this image is already aimed at more specialized applications intended for GPU anyway.

Note in this experimental repo we have the cpu-based rocker/ml as well as the rocker/ml-gpu, only the latter builds on rocker/cuda and would thus get the GPU blas. of course a lot the specialized ML packages (xgboost, h2o keras) are either already linking these libs (via their calls to python or java), or else doing other gpu-optimized algorithms, but having rocker/cuda support GPU blas out of the box could make it a useful image to users where the GPU linear algebra is useful in contexts wholly apart from the ML packages.

@noamross
Copy link

Agreed that we should benchmark but in principle it seems a reasonable default for the cuda-based images. If you have an experimental fork with a script I'll get to it on our hardware, and maybe others (@MarkEdmondson1234), can give it a go, too?

@cboettig
Copy link
Member Author

@noamross Thanks!

Yes, I think I have an experimental version of this on the nvblas branch on the cuda/base/Dockerfile. (Help testing would be great since I just had to send my System76 desktop with my GPU back to the shop for weird crashing behavior :-( ).

So one thing is that I'm following NVIDIA's advice to use LD_PRELOAD instead of re-linking. Like they say, you don't want to set LD_PRELOAD globally, since then it would get set before every shell command run on the system, so I cribbed this approach to load it just before the R, Rscript, and rserver sessions:

ml/cuda/base/Dockerfile

Lines 89 to 105 in 87726cf

RUN mv /usr/local/bin/R /usr/local/bin/R_ && \
mv /usr/local/bin/Rscript /usr/local/bin/Rscript_ && \
echo "#!/bin/sh
\nLD_PRELOAD=$CUDA_BLAS /usr/local/bin/R_ \"\$@\"" \
> /usr/local/bin/R && \
chmod +x /usr/local/bin/R && \
echo "#!/bin/sh
\nLD_PRELOAD=$CUDA_BLAS /usr/local/bin/Rscript_ \"\$@\"" \
> /usr/local/bin/Rscript && \
chmod +x /usr/local/bin/Rscript
RUN echo "#!/usr/bin/with-contenv bash \
\n## load /etc/environment vars first: \
\n for line in \$( cat /etc/environment ) ; do export \$line ; done \
\n export$LD_PRELOAD=$CUDA_BLAS \
\n exec /usr/lib/rstudio-server/bin/rserver --server-daemonize 0" \
> /etc/services.d/rstudio/run

I'm really not sure that's the best way to do this. If we're adding it to the library, it probably makes more sense to configure it directly as the system's blas, but I'd have to refresh on how to do that (particularly in a non-interactive session like the Dockerfile). @eddelbuettel has loads more experience with linking blas libraries and all and can probably give us some pointers (perhaps after recovering from the horror of seeing LD_PRELOAD approach above?).

I did give this a quick run on my system before sending it back and the results were impressive for basic matrix multiplication and determinants, particularly compared to default (non-parallel) blas. For openblas it depended more on how many CPU threads and much memory was available to the CPU relative to your GPU, but notably it was never slower linking the GPU libraries (perhaps because the nvblas-conf file already links the openblas cpu libs as the fallback anyway). But could use more testing; and I haven't run this exact dockerfile yet (or run in the RStudio mode), I was just running interactively on the machine...

@eddelbuettel
Copy link
Member

Sorry to hear about the crashes. Frustrating.

My experience with "plugging BLAS in and out" is/was limited to system others made that already supported it :) I.e. the Debian BLAS maintainer had this brilliant idea of using the interchangeable nature of BLAS/LAPACK along the 'dpkg-alterntatives' mechanism of setting and adjusting softlinks to really make it swappable. In that we could lean on that scheme and try to fold NVidia's BLAS into it.

Otherwise LD_PRELOAD does the same: by rejigging the search order, you get your preferred BLAS in lieu of a default. So in that sense what you did here should do the trick.

@MarkEdmondson1234
Copy link
Contributor

Would be happy to do some benchmarking but would need some demo code to run as BLAS etc all over my head :)

@eddelbuettel
Copy link
Member

Roughly a hundred years ago I did just that in what is now this repo using an existing R benchmark package / script. If memory serves then Colin's benchmarkme package uses the same. It all goes back to an original old script by Simon U. Can you start off that?

@MarkEdmondson1234
Copy link
Contributor

Looks good!

@restonslacker
Copy link
Contributor

restonslacker commented Mar 13, 2019 via email

@cboettig
Copy link
Member Author

@restonslacker whoops, that was just a typo in the Dockerfile (apparently you can't escape a literal ! while using double quotes for $VARS....) should be fixed now

@lezwright
Copy link

Hi, did you have any chance with the LE_PRELOAD and R? When I use this approach I can hardly engage the GPU.

@cboettig
Copy link
Member Author

This example should run on the GPU using our docker images (e.g. rocker/ml) with NVIDIA BLAS.

Note that this is obviously hardware-dependent -- in particular, NVIDIA BLAS uses a configuration that enables a fall-back to CPU-BLAS if it decides the problem size is too large for the GPU. Also note that there's non-trivial overhead in moving the data from CPU to GPU, which can often swamp the time saved in the actual GPU-based computation.

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants