-
Notifications
You must be signed in to change notification settings - Fork 149
Home
jblas is a linear algebra library for Java which uses BLAS and LAPACK libraries for maximum performance. BLAS and LAPACK are very large libraries dating back several decades with the original version being written in FORTRAN. BLAS (Basic Linear Algebra Subroutines) covers basic operations like vector addition, vector copying, and also matrix-vector and matrix-matrix multiplication.
LAPACK covers more high-level matrix routines, as for example, solving linear equations, or computing eigenvalues. LAPACK typically uses BLAS for all the low-level computations. High performance libraries for these libraries exist, often consisting of hand-tuned assembler code for the innermost loops. One very popular open source variant is ATLAS, which automatically tunes the innermost loops for maximum throughput.
Most Java people I know are quite reluctant to leave the world of pure Java, and I think partly they are right. Once your start to interface with native code, the build process becomes much more involved, garbage collection of native objects might be a problem, and it becomes harder to stay platform independent.
On the other hand, I can assure you that it is just not possible to get the amount of performance of native matrix libraries in pure Java code. To understand why, you need to know that there are two key ingredients to fast matrix computations:
- Filling the pipeline.
- Data locality.
To give you an idea just how fast native implementations are, single precision matrix-matrix multiplication with ATLAS on an Intel Core2 processor yields about 11 GFLOPS on a 2 Ghz Machine, which amounts to roughly more than 5 completed single precision floating point operations per clock cylce. To achieve this, you need to know pretty well in which order you have to interleave assembler commands to achieve this amount of throughput.
With Java, you have no control over whether your code will eventually be JITed and how, therefore you have to rely on the Java compiler to perform it, and currently, Java compilers achieve about half a floating point operation per clock cycle.
Second of all, when you multiply two matrices, you have to perform of the order of n 3 operations on n 2 memory. In order to achieve maximum performance, you have to make sure that you have as few cache misses as possible and intelligently move through your data. As far as I have seen, you can control this to a certain extent in Java, but because you don’t really have pointer arithmetic, certain kinds of schemes become computationally quite expensive.
As said above, you can only get really good performance with native code. On the other hand, when the computation means to basically go through all of your data only once (for example, vector addition), then it turns out that calling native code is too expensive from Java, and you have to stick to pure Java code. For jblas, this is the case for all vector operations, and also for matrix-vector multiplication.
The reason for this are:
- You cannot control native memory from Java really well. You might have heard about native buffers, but what they haven’t told you is that Java doesn’t properly garbage native buffers. The problem is that Java doesn’t take the size of the native buffers into account, meaning that it will at some point free a native buffer, but if you have really many of those, then it probably won’t be before you run out of memory. This means that all data has to be stored in Java objects.
- When passing Java objects to native code, arrays are copied. This means that already the copying amounts to one run through the data, making it particularily ineffective when you do something simple as adding all elements.
The bottom line is that it pays of really well to call native code for anything which takes more time than to copy the data, which luckily holds for matrix-matrix multiplication, eigenvalue decompositions, solving linear equations, etc.