Hello everyone, we are working on integrating PyTo...
# kotlindl
р
Hello everyone, we are working on integrating PyTorch C++ API into Kmath (for both native and JVM). Have a look at the prototype here: https://github.com/mipt-npm/kmath/tree/feature/torch/kmath-torch Comments and suggestions are very welcome. Here are some rough local benchmarks mostly for you to get excited about the project as well: Benchmarking 20 x 20 Real matrices on CPU: K/N: 2.17us p.o. with 100000 iterations JVM: 2.25us p.o. with 100000 iterations C++: 1.98us p.o. with 2115299 iterations Benchmarking 200 x 200 Real matrices on CPU: K/N: 108us p.o. with 10000 iterations JVM: 85.1us p.o. with 10000 iterations C++: 81.36us p.o. with 53283 iterations Benchmarking 2000 x 2000 Real matrices on CPU: K/N: 75.3ms p.o. with 20 iterations JVM: 75.1ms p.o. with 20 iterations C++: 74.7ms p.o. with 58 iterations Benchmarking 20 x 20 Float matrices on CPU: K/N: 2.06us p.o. with 100000 iterations JVM: 2.05us p.o. with 100000 iterations C++: 1.82us p.o. with 378177 iterations Benchmarking 20 x 20 Float matrices on CUDA(index=0): K/N: 7.10us p.o. with 100000 iterations JVM: 7.02us p.o. with 100000 iterations C++: 6.93us p.o. with 101669 iterations Benchmarking 200 x 200 Float matrices on CPU: K/N: 41.6us p.o. with 10000 iterations JVM: 42.6us p.o. with 10000 iterations C++: 42.9us p.o. with 16088 iterations Benchmarking 200 x 200 Float matrices on CUDA(index=0): K/N: 10.3us p.o. with 10000 iterations JVM: 10.3us p.o. with 10000 iterations C++: 10.6us p.o. with 65344 iterations Benchmarking 2000 x 2000 Float matrices on CPU: K/N: 36.2ms p.o. with 20 iterations JVM: 37.6ms p.o. with 20 iterations C++: 36.8ms p.o with 76 iterations Benchmarking 2000 x 2000 Float matrices on CUDA(index=0): K/N: 1.46ms p.o. with 1000 iterations JVM: 1.48ms p.o. with 1000 iterations C++: 1.78ms p.o. with 1000 iterations Benchmarking generation of 100000 Normal samples on CPU: K/N: 688us p.o. with 100 iterations JVM: 855us p.o. with 100 iterations C++: 684us p.o. with 4149 iterations Benchmarking generation of 100000 Normal samples on CUDA(index=0): K/N: 5.94us p.o. with 100000 iterations JVM: 6.31us p.o. with 100000 iterations C++: 5.60us p.o. with 490027 iterations Benchmarking generation of 100000 Uniform samples on CPU: K/N: 396us p.o. with 100 iterations JVM: 476us p.o. with 100 iterations C++: 402us p.o. with 1765 iterations Benchmarking generation of 100000 Uniform samples on CUDA(index=0): K/N: 5.74us p.o. with 100000 iterations JVM: 6.21us p.o. with 100000 iterations C++: 5.59us p.o. with 126191 iterations Thread in Slack Conversation
👍 3