By the way, here are the results of naive tests (no JMH, so be careful with conclusions) on GraalVM11:
Boxing addition completed in 22157 millis
Specialized addition completed in 1840 millis
Nd4j specialized addition completed in 1309 millis
Viktor addition completed in 1966 millis
Parallel stream addition completed in 1457 millis
Automatic field addition completed in 1773 millis
Lazy addition completed in 14157 millis
ND4J uses OpenBlas under the hood. And I think @Iaroslav Postovalov told me that is uses parallel execution. I wonder if there is a large overhead on top of BLAS. Because the results are very close.
i
Iaroslav Postovalov
01/24/2021, 12:22 PM
Nd4j uses MIMD
a
altavir
01/24/2021, 12:23 PM
SIMD is not parallel. Viketor and JDK11 also use SIMD
i
Iaroslav Postovalov
01/24/2021, 12:23 PM
I said MIMD
Iaroslav Postovalov
01/24/2021, 12:24 PM
Of course, if the target supports AVX2 or AVX512
a
altavir
01/24/2021, 12:25 PM
I did not enable it for this test. And actually it did nothing when I've uncommented your lines in build