Finally made Multik wrapper work. Performance for linear add operations is of the same order for KMath-specialized buffer, Viktor and Multik (about 10-20% difference which is not relevant). It is much faster for @Ролан's tensor in-place operations (like 300%). I guess that memory allocation is the most expensive thing for simple operations. In KMath it is possible to do memory buffer pooling inside fixed size algebra context. I wonder if it makes sense to do.