Compare C
/Java
/Python
Performance in Matrix Multiplication
- Implement matrix multiplication in C, Java and Python
- For C, two versions
- Array + for loop
- Pointer arithmetic + for loop
- For Java, three versions
- Standard for loop
- Stream (Parallel)
- Without JIT
- For Python, two versions
- Standard for loop
- Numpy
- For C, two versions
c-std
: Implementation in C using for loop and arrays (this should have been named c-for
for coherence but too late)
c-pointer
: Implementation in C using for loop and pointer arithmetic
java-for
: Implementation in Java using for loop and arrays
java-no-jit
: Same as above but disable JIT during runtime
java-stream
: Implementation in Java using Parallel Streams (JDK8+) and arrays
python-for
: Implementation in Python using for loop and arrays
python-np
: Implementation in Python using numpy library
- We'll be working with square matrices ranging from
16x16
to1024x1024
. We'll focus more on matrices above256x256
. - We'll only consider
int
types for simplicity
- Only benchmark the matrix multiplication operation
- Wont consider time taken for other actions such as I/O
- Use same matrices for all variations
- Generate matrices on first run (Java) and save them to files
- For all variations, read from file and execute
- Iterations
- For each size, we will multiply
20
times and sum the durations to get larger duration results
- For each size, we will multiply
For example, python-np: 1024, 122.44
means, it took 122.44 seconds
to multiply 20
1024x1024
matrices with 20
other corresponding 1024x1024
matrices in the python
implementation using numpy
.
- CPU: Ryzen 3950x (16/32 Cores)
- RAM: 64GB
- GPU: N/A (Maybe in future, we can do CUDA)
- C: MinGW, C23, CMake
- JDK: 18 (Corretto)
- Python: 3.9 (Conda)
- OS: Windows 10
TBA if requested
c-std
: C, being the low level system language with lowest overhead, will be fastest (well, 2nd most; see below).c-pointer
: Using pointer arithmetic, we should be able to squeeze out a little more performance and should be fastest (faster thanc-std
).java-for
: Java, being byte-code interpreted language during runtime, will be slower than C variations.java-no-jit
: WithoutJIT
optimization, Java will be extremely slow.java-stream
: This should be really fast as Java will utilize multiple CPU cores but can it be faster than C?python-for
: This is similar tojava-for
but should be even slower as Python's interpreter is much slower than Java.python-np
: This should get us similar (but slightly slower) results toc-std
asnumpy
is an extension module written inC
.
The chart contains data 7x
data points for each variations and there are 7
variations as well. So we have 49
data points in total.
NOTE: We have ignored three
data points in the above chart; namely java-no-jit @ 1024
, python-for @ 512
and python-for @ 1024
as each of these took over ~500 seconds and made it harder to visualize other points of interest. We will talk about these values later.
- C is really fast for smaller dataset but is not performing as expected for larger datasets
- Java turned out to be much faster than expected
- Stock Python is many orders slow compared to everything else
We'll be explaining each of these shorty.
- Variations of C show absolute dominance.
java-stream
appears to be very slow.python-for
is already slow compared to others (exceptjava-stream
)python-np
closely followsc-std
as predicted.
- Variations of
c
are closely holding up. java
variations are slow (as expected) andjava-no-jit
is look really bad; almost as bad aspython-for
.python-for
, is already showing the extremely lack of performance of python.python-for 256, 86s
vspython-np 256, 0.319s
. Just by replacingfor-loops
withnumpy
's functions which execute usingc
, we are seeing insane amount of performance difference.python-np
closely followingc-std
as before.
Things get really interesting here. Compared to other matrix sizes, these two are huge. The amount of mathematical operations is way higher and requires much more memory.
For example, mathematical operations for a matrix is O(n^3)
.
-
256x256
:16M
-
512x512
:134M
(8 times compared to256x256
) -
1024x1024
:1B
(62 times compared to256x256
) -
As expected of
c
variations, their times shot up due to huge number of operations increase. -
While
java-for
andjava-stream
were slower than variations ofc
before, they have turned it upside down and now are faster thanc
variations and in one particular case, amazingly faster. We'll explain this below. -
java-no-jit
is showing how useless Java is without JIT optimization and is only trailed bypython-for
. -
python-for
is taking over 5000 seconds for 1024x1024 compared to C's ~100 seconds. -
As expected,
python-np
is trailing C's performance closely but slightly slowly.
Assuming c-std
as baseline performance,
- Why
c-pointer
is slower thanc-std
? - Why does
java-for
suddenly do better as matrix size increases? - How did
java-stream
outperform everything? - Why
java-no-jit
is so slow? - Why
python-for
is even slower and excessively so? - Why
python-np
is so much faster thanpython-for
and almost same asc-std
?
- While theoretically,
c-pointer
should be faster, modern C compilers do amazing optimizations during compilation resulting in extremely fast machine codes. By hand-rolling array system using pointers, we have prevented the compiler for making every optimization it could and thus,c-pointer
is slightly slower thanc-std
. java-for
implementation runs the java program with default settings whereJIT Optimization
is enabled. This causes java to utilize various loop optimizations such asloop unrolling
andloop tiling
. This is possible to do inC
too but we will require a huge amount of code/time for this. Libraries likeBLAS
implements these optimizations and will be even faster than any Java implementation (to be tested though).java-stream
implementation is similar tojava-for
but it is fully parallel [ Link to Code ]. It can and will saturate every ounce of CPU power available and come up with the result more faster as you can give it more CPU cores. Again, if we implement similar algorithm in C (very difficult), C variation will outperform the java variation.java-no-jit
is 100% pure interpretation and without any single optimization. This is theoretically as slow java as possible.python-for
, same asjava-no-jit
except their interpreter is even slower.- As mentioned before,
numpy
delegates it's operations to modules written in pure C (well, 95% C, 5% C++) as such it can harness the performance of C.
In general, C
is faster but for large datasets, we need to implement non-trivial amount of optimizations ourselves (or use libraries) otherwise, it will fall behind.
Java, on the other hand, is a breeze to write and automatically optimizes code path using JIT. It is truly an excellent balance between performance and ease of use. Finally, python is the slowest of them all but python can delegate computationally expensive tasks to extension modules written in C and harness that power. Therefore, as long as python has great extension modules, it is an excellent choice for most operations.
Loop unrolling
and Loop Tiling
are excellent techniques and can provide a huge amount of performance boost if the programmer knows what they are doing. Cache Oblivious Algorithm
is another concept to explore to further utilize loop tiling
to maximize cache memory usage and increase performance by many folds.
- END