I'm not really sure if this would be something for this project, I know when it comes to large matrices simple operations like dot product, transposition etc can be very slow due to the number of elements. A while back was looking into cache oblivious algorithms, so in the case of matrix multiplication it would divide the matrix into smaller block matrices that would at run time determine the optimal size to put data into the caches of the cpu, these oblivious algorithms, some search algorithms as well, ran at nearly constant speed in terms of effiencey regardless of the dimensions. Like I said not sure if it would be something for hear or elsewhere but attempting to implement some on intensive algorithms may be useful. For instance I know that tensorflow at points can be very slow just due to tensor operations