Managed to beat numpy performance for ndarray operations for large scale arrays (for small scale I can't, it uses some kind of stack magic, but it does not scale). With generic arrays and without code generation and ugly code. Will run some more tests and upload it later.