Latest matrix dot (1000 x 1000 matrices) benchmark...
# mathematics
a
Latest matrix dot (1000 x 1000 matrices) benchmark results for different KMath algebras (after tensor algebra fix by @Ivan Kylchik:
Copy code
jvm summary:
Benchmark                            Mode  Cnt   Score   Error  Units
DotBenchmark.bufferedDot            thrpt    5   1.017 ± 0.399  ops/s
DotBenchmark.cmDot                  thrpt    5   0.589 ± 0.348  ops/s
DotBenchmark.cmDotWithConversion    thrpt    5   0.627 ± 0.395  ops/s
DotBenchmark.doubleDot              thrpt    5   1.168 ± 0.125  ops/s
DotBenchmark.doubleTensorDot        thrpt    5   1.263 ± 0.235  ops/s
DotBenchmark.ejmlDot                thrpt    5   2.573 ± 0.427  ops/s
DotBenchmark.ejmlDotWithConversion  thrpt    5   2.332 ± 0.239  ops/s
DotBenchmark.multikDot              thrpt    5  16.067 ± 0.959  ops/s
DotBenchmark.tensorDot              thrpt    5   0.614 ± 0.053  ops/s
DotBenchmark.tfDot                  thrpt    5   3.905 ± 0.541  ops/s
Multik result (cc @Pavel Gorgulov) is a bit surprising. I did not expect it to be much faster than TensorFlow. KMath-core results are good enough for default implementation.
🙌 4
a
Maybe Tensorflow is slower because it is optimized for the specific use-case of a chain of operations on GPU ? I think it is designed to "assemble" a pipeline on the GPU, and then execute it all locally in gpu memory, limiting exchanges between devices. Also, tnsorflow is at its core a C lib, so I think the java bindings inherits performance flaws of JNI
a
We are talking about huge matrices (1000 x 1000) and cubic algorithm. Boundary overhead should be negligible. Multik uses native for that as well. The though about optimization is valid and I think TF will win on long chains (KMath uses lazy graph for TF). It is possible that BLAS used in Multik utilizes more advanced algorithm with lower complexity, which could not be used in TF-like environment.
Here is the result for smaller matrices (100 x 100):
Copy code
jvm summary:
Benchmark                            Mode  Cnt     Score    Error  Units
DotBenchmark.bufferedDot            thrpt    5   648.707 ± 31.499  ops/s
DotBenchmark.cmDot                  thrpt    5  1235.923 ± 13.664  ops/s
DotBenchmark.cmDotWithConversion    thrpt    5   853.729 ± 23.544  ops/s
DotBenchmark.doubleDot              thrpt    5   762.809 ±  9.829  ops/s
DotBenchmark.doubleTensorDot        thrpt    5   173.659 ±  5.294  ops/s
DotBenchmark.ejmlDot                thrpt    5  2619.588 ± 98.003  ops/s
DotBenchmark.ejmlDotWithConversion  thrpt    5  1338.647 ±  9.905  ops/s
DotBenchmark.multikDot              thrpt    5  1670.550 ± 72.283  ops/s
DotBenchmark.tensorDot              thrpt    5   214.748 ± 12.770  ops/s
DotBenchmark.tfDot                  thrpt    5   401.539 ± 37.549  ops/s
Interesting dynamics.
Again, Mutlik gives very nice result as well. And KMath-core is reasonable.
All in all, switching to multik context in KMath for super-heavy computations seems to be a good idea.
👍 2